|Home | About | Journals | Submit | Contact Us | Français|
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Protein kinases are critical to cellular signalling and post-translational gene regulation, but their biological substrates are difficult to identify. We show that cyclin-dependent kinase (CDK) consensus motifs are frequently clustered in CDK substrate proteins. Based on this, we introduce a new computational strategy to predict the targets of CDKs and use it to identify new biologically interesting candidates. Our data suggest that regulatory modules may exist in protein sequence as clusters of short sequence motifs.
Protein kinases are ubiquitous components of cellular signalling networks . A relatively well understood example is the network that controls progression of the cell cycle, where cyclin-dependent kinases (CDKs) couple with various cyclins over the cell cycle to regulate critical processes [2-4]. Despite their biological and medical importance, relatively few direct, in vivo targets of these kinases have been identified conclusively, because experimental techniques are difficult and time consuming [1,5]. With the availability of databases of protein sequences, computational methods provide an alternative approach [6,7].
Kinase substrates often have short, degenerate sequence motifs surrounding the phosphorylated residue . Putative target residues can be predicted by searching for matches to the consensus for a particular kinase. For example, CDK substrates often contain S/T-P-X-R/K where X represents any amino acid, and S/T represents the phosphorylated serine or threonine [9,10]. Because of the low specificity of the CDK consensus, however, databases of protein sequences are expected to contain large numbers of matches by chance. Therefore, many of the matches in protein sequences are likely to be false-positive predictions. Consistent with this, when 553 Saccharomyces cerevisiae proteins with at least one match to the CDK consensus were tested in a high-throughput kinase assay, only 32% (178) were found to be substrates . Furthermore, in some cases characterized CDK substrates are phosphorylated at residues matching only a minimal consensus S/T-P ; considering these weak matches would probably lead to even larger numbers of false positives.
Characterized CDK targets may be phosphorylated at multiple residues (for instance, see the report by Lees and coworkers ). Recent studies of several CDK target proteins in S. cerevisiae have shown that these multiple phosphorylations can regulate stability , protein interaction [14,15], or localization . Motivated by these observations, we propose an alternative computational strategy to identify substrates of CDKs; instead of attempting to predict individual phosphorylation sites, we search for proteins that contain high densities of strong and weak consensus matches that are closely spaced in the primary amino acid sequence. (We refer to this close spacing as 'clustering', and this should not be confused with clustering of multivariate data.)
Taking advantage of the results of a high-throughput study , we show statistically that CDK1 targets in S. cerevisiae contain multiple closely spaced consensus matches and we develop computational methods to identify such proteins. We also find that these clusters tend to occur in disordered or unfolded regions near the termini of the protein. We show that it is possible to predict proteins that are likely to be targets of CDKs in S. cerevisiae by searching for proteins that contain clustered matches to the CDK consensus. We also show that human CDK targets are enriched for proteins that contain clustered consensus matches and, by searching human cell cycle genes, we predict several putative CDK targets, including the human orthologs of Schizosaccharomyces pombe CDC5 (CDC5L) and S. cerevisiae Cdc20p (CDC20). Finally, we examine co-clustering of the CDK consensus motifs with the 'cy' or RXL motif , which is known to be important in determining which CDK-cyclin complex will phosphorylate a given substrate.
CDK substrates in S. cererevisiae are often phosphorylated at multiple serine or threonine residues, some of which match the full (henceforth 'strong') consensus S/T-P-X-R/K, whereas others match a minimal (henceforth 'weak') consensus S/T-P. For example, the amino-terminal region of Cdc6p (Figure (Figure1b)1b) is a direct target of Cdk1p (also known as Cdc28p) , and contains three strong and one weak CDK consensus. In order to test whether these observations could be used to predict new substrates, we first compared the number of matches of each motif per residue in a set of 12 Cdk1p targets known from low-throughput biochemical and genetic experiments (compiled by Ubersax and coworkers ; henceforth referred to as 'known' targets; see Table Table11 and Figure Figure1a)1a) with the number in the genome. We find a highly significant, more than ninefold enrichment of the strong consensus (Figure (Figure2a,2a, left side) but not for a scrambled version (P-R/K-X-S/T) of the consensus (Figure (Figure2a,2a, right side), indicating that the enrichment is not due to simple compositional effects. For the weak consensus (after masking the strong consensus), we also find enrichment over the genome and not for a scrambled consensus (after masking the weak and strong consensus), but it is less striking (less than twofold; Figure Figure2b2b).
Because we were concerned that the discovery of the known targets may have been biased by the observation that they contained many matches to the strong consensus, we also computed these frequencies for the 18 proteins out of a set of 198 randomly chosen genes from S. cerevisiae identified as Cdk1p targets in a high-throughput assay  (henceforth referred to as 'unbiased positives'; see Table Table1).1). We found similar results in this unbiased positive set, although the enrichment of strong matches was just under fourfold in this case and the enrichment of weak matches was less than 1.5-fold (Figure (Figure2).2). That the fold enrichment is somewhat less in this set is consistent with some of the enrichment in the known set being due to bias in their discovery, but also with some false-positive findings being picked up in the kinase assay. Nevertheless, this rules out the possibility that the enrichment of matches in bona fide CDK substrates is only the result of a bias.
Examination of phosphorylated residues in CDK target proteins reveals that they are often found 'clustered' in one region of the primary amino acid sequence (Figure (Figure1).1). We sought to test whether this apparent clustering was due simply to a uniform overall enrichment of consensus matches in these proteins, or whether it was a preference for the consensus matches to occur near each other. We modeled the number of residues until a strong or weak match was identified using a bivariate geometric distribution (see Materials and methods, below). We then performed a likelihood ratio test (LRT) between the hypothesis (H1) that the spacings were drawn from a mixture of a high-density 'cluster' component and a low-density 'background' component, and the hypothesis (H0) that the spacings were simply the result of a single uniform density component (Figure (Figure3).3). In order to compare these models, we maximized the likelihood under each hypothesis using expectation-maximization (EM)  (see Materials and methods, below) and computed the likelihood ratio statistic:
Where data represents the observed spacings and corresponding (strong or weak) consensus matches. Because H0 corresponds to the case of H1 with the parameters of the two components constrained to be equal, we expect the LRT statistic (Λ) to be χ2 distributed with three degrees of freedom (see Materials and methods, below).
We therefore computed the P values for the LRT on the known targets, the set of 'unbiased positives', the remaining randomly chosen proteins that were found not to be targets of Cdk1p in the assay  (henceforth referred to as 'unbiased negatives'; see Table Table1),1), and the 'known' targets using the scrambled consensus sequences (Table (Table2).2). Consistent with the model that bona fide targets contain clusters of consensus matches, rather than a simple overall enrichment, we could reject the overall enrichment hypothesis in the first two tests (P = 1.2 × 10-9 and P = 1.6 × 10-4, respectively), but not in the latter two negative controls (P = 0.13 and P = 0.15, respectively; see Table Table22).
Having established statistical enrichment and tendency for consensus matches to cluster in the primary sequence of bona fide CDK targets, we developed a method to predict CDK targets based on these properties. For each protein, we sought to compare the likelihood of the observed matches and spacings given the genome frequencies (Hbg) with the likelihood under a two-component model (Hc), in which one component is the background genome model and the other is high-frequency 'cluster' component whose parameters are estimated from the protein of interest. This suggests ranking genes according to the following:
Because the weak CDK consensus matches the specificity of any proline-directed kinase, we were concerned that some of our predictions would not be specific to CDKs. In order to rule out these cases, we defined a 'nonspecific' model (Hns) as above, except that the frequency of strong matches in the high-frequency 'cluster' component was constrained to be less than or equal to the background genome frequency. We optimized the likelihood under each of these models for each protein (see Materials and methods, below) and ranked them by a classifier assuming uniform 'priors' over the various models:
This will assign lower scores to proteins that have clusters of only weak consensus matches. Cdc6p (Figure (Figure1a),1a), for example, has SLR = 7.28, and ranks 22nd in the genome.
The mixture models we have employed thus far do not assume that the closely spaced matches fall in a single contiguous region of the primary sequence. We considered this appropriate because residues may be adjacent in the structure of the protein but not in the primary sequence. Nevertheless, we were also interested in identifying the continuous subregions of proteins that contain high densities of matches, such as the amino-terminal domain of Cdc6p (Figure (Figure1b).1b). We therefore also developed a method to identify the most significant 'cluster' of matches within each protein. While SLR (described above) measures 'clustering' in the whole protein, this method allows identification of a single optimal 'cluster'. This represents an alternate strategy to predict proteins that contain clusters of consensus matches - by explicitly identifying the clusters. We note that this does assume that the clustered matches occur in a contiguous region, and therefore, for example, in the case of Cdc6p (Figure (Figure1a)1a) the carboxyl-terminal matches would not contribute to the score.
To find optimal clusters, we counted the number of matches (n) to the strong (s) or weak (w) consensus in each possible subregion of the protein of length l. We then computed the probability of observing as many matches or more of each type using the binomial distribution, and combined these P values by multiplying them together by assigning a P value to their product using the Q-fast algorithm . We note that the subregion with the maximal score will begin and end with a match. There are therefore only N(N - 1)/2 possible clusters to try, where N (= ns + nw) is the total number of matches in the entire protein. This means that proteins with many matches have more chances to obtain a high scoring cluster. We therefore correct for the total number of clusters searched by multiplying the P value by this factor (a Bonferoni multiple testing correction). Thus, we define the following:
where Q [...] is the Q-fast algorithm, p(≥ x | l, f) is the binomial probability of observing x or more in l tries when the per try probability is f, and fsb and fwb are the per residue probabilities of observing strong and weak matches, respectively, in the genome. Once again we were concerned about the possibility of nonspecific clusters and therefore, when using SBN to predict CDK targets, we imposed the following heuristic; to be considered, subregions must contain at least one match to the strong consensus per 100 residues. For example, in the case of Cdc6p, this optimal cluster corresponds to the amino-terminal domain (Figure (Figure1b,1b, bold residues) and has SBN = 8.38, ranking 61st in the genome.
In order to assess whether these classifiers were capturing useful information about the recognition of substrates by CDKs, we computed the scores described above for each protein in S. cerevisiae and compared them to the 'phosphorylation scores' reported for the 695 S. cerevisiae proteins tested in the high-throughput Cdk1p assay  (Table (Table1).1). These proteins tested in that study fall into three groups: 198 randomly chosen proteins (containing the 'unbiased positives' and 'unbiased negatives' described above, henceforth referred to as 'unbiased'), all 385 S. cerevisiae proteins that contain two or more matches to the strong CDK consensus (henceforth '2+'), and finally 137 proteins that contain one match to the strong consensus, and exhibit cell cycle transcript regulation (henceforth '1cc'). We note that although the last two groups were biased in different ways, as long as we treat them separately (condition on the bias) the proteins in each group can be treated as identical and independently distributed.
In the 'unbiased' and '2+' groups, we found a highly significant correlation (R > 0.3, P < 10-10) between the phosphorylation score in the assay and both of the cluster-based scores described above (Table (Table3),3), such that proteins with higher scoring cluster are more likely to have high scores in the kinase assay.
Because in many cases we noted that the clusters seemed to occur near the carboxyl- or amino-terminus of the proteins (as in the case of the Cdc6p amino-terminal domain; Figure Figure1),1), we computed the relative 'position' of the optimal cluster, where 0.5 is the midpoint of the protein and 0 is either terminus (see Materials and methods, below). Interestingly, we found that the position was negatively correlated (R < -0.2, P < 0.01), with the results of the kinase assay in the same two groups of targets, such that proteins with clusters near their termini were more likely to be positive in the assay. It has also been noted that phosphorylation sites tend to fall in disordered or unfolded regions of proteins . Consistent with this, we found a significant correlation (R ≤ -0.19, P < 0.01) between the 'foldedness' [21,22] of the cluster and the score in the kinase assay, such that proteins containing clusters of matches in unfolded regions were more likely to be bona fide substrates. In order to verify that these factors were independently correlated with the results of the assay (and not simply correlated with each other), we fit linear models of the likelihood ratio score, position and 'foldedness', and found that they all contributed significantly (P < 0.02; Table Table33).
The correlations we observed suggested that clustering of consensus matches could be used to predict the targets of Cdk1p in S. cerevisiae. Taking proteins defined as CDK targets or not in the high-throughput assay  as positives and negatives, we computed receiver operating characteristic (ROC) curves for the three groups of proteins tested in the assay.
First, we compared the two classifiers described above to simply classifying based on the density of strong CDK matches in the protein. We found that although all were strong classifiers in the 'unbiased' set, the cluster-based methods performed better than a simple density (Figure (Figure4a).4a). In the low false-positive range, which is of most relevance to protein database searches, the score based on the likelihood ratio (SLR) seemed most effective. We also compared the methods on the '2+' set and found similar results (data not shown). We therefore used SLR for subsequent analyses.
We next compared the predictive power of the cluster-based classifier (SLR) with that of a specificity matrix-based approach (Scansite ), and used the score of the best match to the Cdc2 matrix in each protein (see Materials and methods, below) as the predictor. Both our cluster-based method and the specificity matrix-based method were strong classifiers for the 'unbiased' set (Figure (Figure4b);4b); since most of these proteins contain no matches, many of the negatives can be ruled out simply based on the absence of a match to the consensus. For the '1cc' proteins, neither method has much power (Figure (Figure4d).4d). For the '2+' set (Figure (Figure4c),4c), however, we notice a considerable increase in sensitivity and specificity in the low false-positive region by using our cluster score. In the '2+' group, at false-positive levels near 5%, the matrix-based method performs similar to a random classifier, whereas the cluster-based method retains some power. Because each of these proteins has multiple matches to the consensus, most have high matrix match scores. The proteins in which there are multiple matches that are spatially clustered, however, are more likely to be bona fide substrates for Cdk1p. We note that even in this set the overall predictive power is still relatively poor.
An important feature of these cluster based methods is that we can include weak matches to the consensus in our predictor. We found, however, that classifiers based on clustering only of strong matches also performed well (data not shown). In order to confirm that the weak matches were contributing to the clusters, we identified optimal clusters based only on the strong matches using a univariate version of the method described above (SBN). We then compared the density of weak matches in these regions with the density of the scrambled weak consensus. We found enrichment of 2.1-fold and 1.4-fold in the 'known' targets and assay positives (all groups combined), as compared with 1.2-fold in the negatives (all groups combined; Figure Figure5),5), indicating that weak matches are preferentially associated with clusters of strong matches. The size of these effects is not great, however, and therefore weak matches may not contribute much to the classification of individual proteins. Nevertheless, this supports the use of both the strong and weak consensus matches in this case, and is consistent with previous reports that weak sites can be important for function  (see Discussion, below).
Our aim here was not to explore the properties of these classifiers in detail, but rather to establish the potential of methods that take advantage of the propensity of the CDK motifs to cluster (see Discussion, below).
Taken together, these results suggest that not all Cdk1p targets in S. cerevisiae contain clusters of consensus matches, but that there is some subset that can be predicted in this way. In order to estimate the number of CDK consensus cluster containing proteins that can be recognized based on sequence alone, we searched the genome for matches to scrambled versions of the strong and weak CDK consensus (P-R/K-X-S/T and P-S/T, respectively) and compared the distribution of likelihood ratio scores with those obtained using the real consensus sequences. Comparison of these distributions suggests a score threshold of 3.5 (Figure (Figure6).6). This yields an excess of 50 proteins, because there are 67 proteins above the threshold when the real consensus sequences are used, and 17 when scrambled consensus sequences are used.
Of these 67 top predictions (ranked based only on sequence), 49 were positive in the kinase assay  (all groups combined). This indicates at this threshold our cluster-based method yields a positive predictive value (PPV) of 73%, but it includes 18 false positives. Compared with the PPV of 49% (17/35) for the proteins identified by the matrix-based approach (Scansite ) at the same false-positive level, our cluster-based approach has significantly greater PPV (P = 0.017, by Fisher's exact test), which is consistent with the hypothesis that searching for clusters can strongly identify at least some subset of CDK targets. In order to examine further the properties of the clustered matches in these proteins, we identified the maximal scoring cluster using the method described above (SBN). Consistent with our earlier observations, we found that for 36% (24/67) of these proteins the optimal cluster ended within 5% of the protein's length from either terminus, and that even if we masked the CDK matches, the optimal clusters were on average significantly less 'folded' that the whole proteins (-0.08 versus -0.0002, respectively; P < 0.001, by Students' t-test).
Regulation of cell cycle progression by CDKs is thought to be an ancient feature of eukaryotic cells. Indeed, human CDK homologs were first identified based on their ability to rescue yeast mutants [24,25]. We therefore sought to test whether clustering of consensus matches could also be used to predict CDK targets in humans.
We found 73 human proteins (see Materials and methods, below) that were listed as CDK, CDK1, or CDK2 targets in the phosphoELM database . Although we do not have a set of negative proteins (as for S. cerevisiae), we can still compute an ROC curve by using the fraction of the genome above the threshold as an approximate false-positive rate. In doing so we assume that the fraction of proteins that are targets in the genome is negligible compared with the total number of proteins. This analysis (Figure (Figure7a)7a) suggests that our method has some predictive power at reasonably low false-positive levels; some subset of human CDK targets may also contain clusters of consensus matches and may therefore be predicted using our method.
To predict novel human CDK targets, we obtained a set of 112 human cell cycle genes (see Materials and methods) and identified those containing clustered consensus matches. Of the six proteins in this set with clusters scoring 3.5 or greater (Figure (Figure7b),7b), none were included in the 73 CDK targets in phosphoELM. Of these, BRCA2 was recently shown to be a CDK target . Of the other five, there is already evidence that three (RANBP2, CDC20, and CDC5L) are mitotic phosphoproteins, and there are varying degrees of evidence that they are bona fide CDK targets [28-30]. The other two (CDCA5/sororin and TPX2) are both degraded by the anaphase-promoting complex through direct interaction with K-E-N motifs [31,32]. Interestingly, these K-E-N motifs are found among closely spaced CDK consensus matches in these proteins (Figure 7c,d). It is tempting to speculate that their anaphase promoting complex-dependent degradation is regulated through phosphorylation by CDKs, as has been suggested for human CDC6 , and that these clusters represent regulatory modules (see Discussion, below). Regardless, that these human cell cycle proteins contain clusters of CDK consensus sequences, and that there is some evidence for CDK phophorylation for four of the six, suggests that cluster-based methods can be used to predict CDK targets among human proteins as well.
CDKs are thought to gain target specificity by pairing with particular cyclins. For example, Cdc6p was found to be a specific target of Cdk1p:Clb5p  and contains cyclin specific 'cy' motifs (R/K-X-L ) in addition to CDK motifs (Figure (Figure1b,1b, filled bars). We noted that of 14 Cdk1p:Clb5p specific targets identified in a recent study , 72% (10) where among our strongest S. cerevisiae predictions (SLR > 3.5). Because, of the 143 proteins tested in that study, only 29% (42) were included in this set (SLR > 3.5), 72% represents a highly significant enrichment (P < 0.001, Fisher's exact test; Figure Figure8a,8a, left side). Interestingly, we also found that the clb5 specific proteins above our cutoff contained a higher proportion of strong matches to the CDK consensus; the clb5 specific clusters contained 43 strong and 18 weak matches (70% strong), which is significantly more than in the clusters in the rest of the proteins above the cutoff, where we find 217 strong and 343 weak (39% strong; P < 0.001, by Fisher's exact test; Figure Figure8a,8a, right side). We speculate that this may be related to the lower overall activity of the Cdk1p-Clb5p complex .
In order to test directly whether 'cy' motifs were associated with the CDK clusters, we masked out the matches to the CDK consensus and compared the frequency of matches to the cy motif in the clb5 specific proteins with the frequency in the rest of the proteins above the cutoff (Figure (Figure8b).8b). Although the frequency of cy motifs in the entire proteins was significantly greater in the clb5-specific targets than in the other proteins (Figure (Figure8b,8b, left side; P = 0.014, by Fisher's exact test), the difference was greater and more significant when we considered only the regions identified as optimal clusters (Figure (Figure8b,8b, right side; P < 0.001, by Fisher's exact test). Futhermore, we note that the regions defined as the optimal clusters in the proteins that were not clb5 specific contain fewer matches to this motif than expected based on the genome frequency, perhaps related to the paucity of leucine residues near phosphorylation sites . These findings suggest that cy motifs tend to cluster with CDK motifs in clb5 specific targets. Thus, it may be possible to associate cyclin specificity with a specific composition of motifs, analogous to the 'regulatory codes' that have been proposed for some enhancers of transcription  (see Discussion, below).
We divide the discussion into two sections, the first addressing biologic considerations and the second methodology.
Several characterized CDK target proteins have multiple consensus phosphorylation sites, often restricted to particular regions of the protein. We confirmed that known S. cerevisiae CDK targets are statistically enriched for CDK consensus matches (Figure (Figure2)2) and that these are closely spaced (clustered) in the linear sequence of these proteins (Figure (Figure33 and Table Table2).2). We showed that spatial clustering is significantly associated with bona fide CDK substrate proteins in S. cerevisiae (Table (Table3)3) and human (Figure (Figure7a),7a), and a search of human cell cycle genes suggested several plausible CDK targets, some of which already have various degrees of supporting evidence (Figure (Figure7b7b).
Noncoding regulatory DNA elements, such as enhancers (or cis-regulatory modules), often contain clusters of binding sites for transcription factors [36,37], and computational methods have been developed to exploit this . In analogy, we suggest that the regions of proteins containing the clusters of CDK consensus matches may be regarded as phospho-regulatory modules. As with cis-regulatory modules, they may contain additional regulatory elements, such as the phosphorylation sites of other kinases, localization and degradation signals, and other protein recognition motifs. For example, the amino-terminal domain of S. cerevisiae Cdc6 (Figure (Figure1b)1b) contains a cluster of CDK consensus matches, as well as a nuclear localization signal .
As an illustration of a potential mechanistic basis for this model, consider the case of clusters of phosphorylation sites proximal to nuclear localization signals (NLSs). Nuclear import is often mediated through the interaction of importins (or karyopherins) with NLSs, which are basic, hydrophilic motifs . The addition of multiple bulky, negatively charged phosphates proximal to these motifs have the potential inhibit their function; indeed, several examples of such inhibitory phosphorylation of nuclear localization signals have been described (for review, see Jans and Hubner ), including inhibition of the SV40 and Swi5p NLSs by phosphorylation of partially overlapping or proximal consensus sites by CDKs [41,42] (Figure (Figure1a).1a). We suggest that such local interactions between motifs may be a general mechanism by which regulatory specificity can be achieved.
Additionally, characterized cy motifs, such as the cy motif in Orc6p , often occur in proximity to clusters of phosphorylation sites (Figure (Figure1a).1a). We showed that the clusters of CDK matches in clb5-specific CDK targets contained a higher fraction of strong CDK consensus matches and an excess of cy motifs (Figure (Figure8).8). These observations are consistent with the lower activity of this CDK-cyclin combination  and its known dependence on the cy motif for substrate recognition . Although the cy motif is known to interact directly with cyclin, it is not mechanistically obvious why multiple copies of this motif would be proximal to the CDK sites in the linear amino acid sequence. Nevertheless, as has been suggested for control of transcriptional regulation , we suggest that these features form a regulatory 'grammar' or 'code' that produces a specific pattern of activity. To test this model further we also examined the order of strong and weak sites in the clusters, but we found no striking patterns (data not shown). We suggest that as more data become available, and computational methods for the prediction of short protein motifs continue to improve, it will be possible to identify increasingly specific combinations of motifs associated with particular patterns of regulation.
It is also important to note that there are additional possible biological explanations for the large numbers of CDK consensus matches, such as precise control of regulatory thresholds  or highly cooperative binding , and that because there are many determinants of kinase-substrate specificity, clustering of regulatory motifs cannot be expected to reveal all (or any) of the substrates of a particular kinase. The general utility of the methods proposed here will depend on the specific biology of each regulatory system.
Indeed, it is perhaps surprising that the proximity of features of the primary amino acid sequence are predictive at all, because these may not be reflected by the three-dimensional structure of the protein. Interestingly, we found that the clusters of CDK matches tend to fall in relatively disordered regions and near the termini of target proteins (Table (Table2);2); perhaps in these regions spatial proximity in the primary sequence is more reflective of proximity in vivo. We suggest that this is also consistent with the hypothesis of combinatorial regulation at these regions through multiple motifs; the regions containing these clusters must be accessible to multiple regulatory proteins, and therefore perhaps tend to be flexible, solvent-exposed regions, rather than highly structured domains.
One of the main limitations of this simplistic approach to predicting CDK substrates is that we cannot distinguish between targets of Cdk1p paired with other cyclins, other CDKs, or other regulatory proteins with similar specificity. This means that even our 'false positive' predictions may contain interesting biologic information despite failing to show phosphorylation by Cdk1p-Clb2p in the kinase assay. For example, among the 67 proteins that we predicted to be CDK targets in S. cerevisiae, 18 were not found to be targets in the high-thoughput assay . Included in these, we find CDK-cln substrates Ste20p and Whi5p [45,46], as well as Cdc14p phosphatase substrates Sli15p and Cdc15p . These examples further demonstrate that CDK consensus clusters often point to functionally important interactions; identifying sequence features that distinguish these different types of interactions is an area for further research.
Despite its simplicity, spatial clustering represents a potentially powerful new method for computationally identifying a subset of CDK targets. Unlike current methods, which predict individual phosphorylation sites, the statistical methods we have developed identify proteins that exhibit clustering of matches to the CDK consensus. In addition, cluster-based methods can incorporate weak matches to the consensus, and we showed that weak sites tend to co-cluster with strong sites in bona fide CDK substrates. Although we have not evaluated whether our method will be applicable to other systems, the methods we have described search for patterns of matches that deviate from a random (geometric) expectation. They can therefore be applied to searches for clusters of any highly degenerate sequence elements, without the need to specify a window size. We suggest, however, that as more is understood about the specific properties of clusters of protein motifs, it will be possible to use more powerful statistical methods to search for particular patterns and combinations of motifs.
A further possible application of spatial clustering of consensus motifs is to identify the functionally important residues in CDK targets. For example, in Cdc6p (Figure (Figure1),1), the 'clustered' matches correspond to the amino-terminal domain, which is phosphorylated and bound directly by CDKs . In long proteins where there are large numbers of consensus matches, we speculate that those in clusters are more likely to represent the critical phosphorylation sites. Finally, we suggest that cluster-based methods could be used in combination with the sophisticated specificity-based methods that are currently widely used , or combined with other evidence, such as structural properties , mass spectrometry data , or evolutionary conservation . Indeed, given the availability of genome sequences, comparative analysis of CDK clustering should yield insights into the evolution of post-translational regulation.
We obtained the human CDK targets from the phosphoELM database  by parsing the webpage CDK, CDK1, or CDK2 to obtain uniprot IDs and then looking these up in Ensembl. Thus, the 73 CDK targets may not include all of the CDK targets in phosphoELM. A list of 112 human genes with described cell cycle phenotypes in human cells was obtained from the Mitocheck database . The 73 known human CDK targets and 112 human cell cycle genes are available in Additional data file 1.
For a given protein sequence, we imagine counting the residues after we see a match until we observe another match to the consensus. Once one is observed, we record the number of residues and the type of match (strong or weak) that was observed. We denote the match as a vector X, such that a strong match is represented as X = (1,0) and a weak match as (0,1). If matches occur independently, then the probability of any particular match is given by the following.
Where m indexes the types of matches (strong or weak), fm is the per residue probability of observing a match of type m, is the probability of any match, and l is the total number of residues observed. We note that this defines a multivariate geometric distribution and we can use the geometric series () to show the following:
This, as expected, is fs/(fs+ fw) or fw/(fs+ fw), for a strong or weak match, respectively. The distribution of spacings, on the other hand, is given by another geometric distribution:
In order to test the hypothesis that the matches are clustered relative to what would be expected under a single frequency for each class of match, we defined a two-component mixture of multivariate geometric distributions, such that:
where f1 and f2 are the parameters for the two components, ϕ 1 and ϕ 2 are their sums, and π is a mixing parameter or prior probability on which component will be observed. The likelihood of the data under this model is simply the product of terms of this form over all of the observed matches and spacings. In order to obtain maximum likelihood estimates of these parameters, we derived EM  update equations as follows. The expected complete log-likelihood of the data is:
Where i indexes the observation, c indexes the two components, angled brackets indicate expectations, and Zci is an unobserved indicator variable that denotes - for each observed match and spacing - which component of the mixture it was drawn from. The maximization step entails setting the derivatives with respect to the parameters to zero, under the constraint that . This yields the following:
where N is the total number of matches in the dataset respectively. At the expectation step we use Bayes' theorem to calculate the following, using the estimates of the parameters (f, π) from the previous iteration:
We used an implementation of this algorithm to maximize the likelihood and obtain estimates of the parameters. For a set of input matches (characterized by spacing, li, and type, Xi) at convergence, the algorithm produces a set of maximum likelihood estimates of the parameters (f, π) with which we compute the maximum value of the likelihood. In order to verify that the EM optimization was reliably finding the maximum in the likelihood (it is only guaranteed to give a local maximum), we re-ran the fit using five random sets of initial parameters.
In practice, protein sequences deviate from these geometric models in that they do not in general end with a match to the motif. In order to account correctly for the remaining residues after the final match, we combine them with the residues before the first match. This also ensures that a given set of matches has the same probability regardless of where it occurs in the protein (relative to the start). Another technical issue with the application of geometric models to proteins is that the decision to begin 'counting' the residues from the amino-terminus or 'left' end is arbitrary; we could equally well have started from the carboxyl-terminus or 'right' end. We confirmed that this makes little difference; counting from 'right' to 'left' gave qualitatively very similar results.
To use these geometric models for hypothesis testing we proceed as follows. The single-component multivariate geometric has two parameters (the densities of strong and weak matches, f), whereas the two-component mixture has five (two sets of densities, f1 and f2, and a mixing parameter π). We note that these models are nested; the single-component model (H0) is a two-component mixture where the parameters for the two components are constrained to be equal (f1 = f2). Because the likelihood in the single-component case (H0) is independent of the mixing parameter, π, there is a three-parameter difference between the two hypotheses. We therefore expect the distribution of the likelihood ratio statistic to be χ2 with three degrees of freedom . To verify that the distribution of the likelihood ratio statistic was indeed χ2 with three degrees of freedom, we randomly permuted the positions of the consensus matches in the 'known' set 100 times, and computed the likelihood ratio statistic for the comparison of the two models; we found reasonable agreement with expectation (data not shown).
We compute SLR for each protein as follows. For each protein, we obtain the set of matches (their positions and type, strong or weak) and compute the likelihood under the following: Hbg (assuming the matches were randomly drawn from the genome frequencies); Hc (fitting the mixture using the EM algorithm described above, but keeping the background component set to the genome frequencies); or Hns (as for Hc, but additionally constraining the frequency of strong matches in the cluster component to be less than or equal to the background frequency). We combine these likelihoods as is given in Results (above). As before, we run the EM with five random starting points for each protein.
We identified the optimal cluster using SBN as described above. To compute the position of the cluster, we calculated the distance between the start of the protein and the start of the cluster, and between the end of the cluster and the end of the protein. We then took the minimum of these divided by the length of the protein to be the position. We computed the foldedness as If = 2.785 × H - |R| - 1.51 [21,22], where H is the average hydropathy  per residue and |R| is the absolute value of the charge (at pH 7.0) per residue in the cluster.
We submitted the yeast protein sequences to the batch Scansite  using low stringency, which yielded 12,134 Cdc2 matches in 4,048 of the 5,889 yeast proteins. We then took the best (lowest score) for each of those proteins.
The following additional data are available with the online version of this paper. Additional data file 1 contains the S. cerevisiae proteins with SLR above 3.5, and the human CDK targets and cell-cycle proteins with associated SLR scores. Scripts to calculate SLR and SBN are available on AMM's website .
This document contains the Saccharomyces cerevisiae proteins with SLR greater than 3.5, and the human CDK targets and cell cycle proteins with associated SLR scores.
We gratefully acknowledge Dr David Morgan for providing data and for stimulating discussions. We thank Dr José Jimenéz for discussions and assistance with online resources, Dr Seth Grant for interesting discussions, Dr Noboru Komiyama for help with protein unfolding, and Dr Avril Coghlan for helpful comments on the manuscript. This work was supported by a Sanger Institute Postdoctoral Fellowship to AMM. RD and the Wellcome Trust Sanger Institute are supported by the Wellcome Trust.