|Home | About | Journals | Submit | Contact Us | Français|
Phosphorylation is a universal mechanism for regulating cell behavior in eukaryotes. Although protein kinases are known to target short linear sequence motifs on their substrates, the rules for kinase substrate recognition are not completely understood. We used a rapid peptide screening approach to determine consensus phosphorylation site motifs targeted by 61 of the 122 kinases in Saccharomyces cerevisae. Correlation of these motifs with kinase primary sequence has uncovered previously unappreciated rules for determining specificity within the kinase family, including a residue determining P−3 Arg specificity among members of the CMGC group of kinases. Furthermore, computational scanning of the yeast proteome enabled the prediction of thousands of new kinase-substrate relationships. We experimentally verified several candidate substrates of the Prk1 family of kinases in vitro and in vivo, and we identified a protein substrate of the kinase Vhs1. Together, these results elucidate how kinase catalytic domains recognize their phosphorylation targets and suggest general avenues for the identification of new kinase substrates across eukaryotes.
As one of the most widespread posttranslational modifications, protein phosphorylation is involved in virtually every basic cellular process, including DNA replication, gene transcription, protein translation, cell growth and metabolism, differentiation, and intercellular communication. With the advent of whole genome sequencing, the entire complement of kinases, or “kinome”, for multiple organisms have been cataloged, revealing that most eukaryotes devote ~2% of their protein coding capacity to these enzymes (1). Unraveling the function of each member of such a large family remains a challenge. Advances in phosphoproteomic methodologies, such as large-scale mass spectrometry (MS)-based phosphorylation site discovery, targeted siRNA screens, the use of analog-sensitive kinase alleles that are engineered to accept specific inhibitors and ATP analogs, and protein microarray analyses, have shed considerable light on the scope and complexity of phosphorylation-based signal transduction pathways in eukaryotes (2–5).
However, one aspect of protein kinase biology that remains poorly understood is how kinases achieve specificity for their target substrates. Understanding rules for substrate recognition by kinases has important applications in the mapping of phosphorylation sites in protein substrates, discovery of new substrates, and production of model substrates for small molecule inhibitor screening (6). In addition, a detailed understanding of how kinases interact with their substrates enables both deciphering and genetic re-wiring of kinase specificity, thereby uncovering fundamental ways in which signaling pathways are organized and propagated (7, 8).
In a typical eukaryotic cell, there are hundreds of thousands of Ser, Thr, and Tyr residues among the thousands of proteins. To ensure signaling fidelity, kinases must somehow discriminate among these vast numbers of potential phosphorylation sites. Mechanisms that influence substrate selection by a protein kinase include subcellular localization, substrate docking interactions, and binding to scaffold proteins (9). An important aspect of substrate recognition, however, is that the phosphorylation site on the substrate falls within a consensus amino acid sequence that is complementary to the active site of the kinase.
Consensus phosphorylation site motifs for protein kinases have been previously established on an individual basis through either the inspection of known phosphorylation sites, systematic mutagenesis of protein and peptide substrates, or screening of peptide libraries (10, 11). Although these studies have provided valuable insight into substrate recognition, such data is only available for a subset of known protein kinases. NetPhorest, which is the most comprehensive repository for kinase phosphorylation site motifs reported to date, includes motifs for only 35% of all human kinases (12). The incompleteness of available data and heterogeneity by which it was collected limits its application to elucidating cellular signaling pathways and modeling larger phosphorylation networks. For example, using motif scanning approaches to link specific kinases to the thousands of in vivo phosphorylation sites discovered through MS-based phosphoproteomics has proven difficult in targeted kinase studies because multiple kinases can potentially target the same or similar motifs.
We thus set out to catalog consensus phosphorylation site motifs for the kinome of the model organism Saccharomyces cerevisiae. We adapted a peptide library screening approach (13) to a miniaturized format that would enable rapid analysis of large numbers of kinases. With this method, we determined consensus phosphorylation motifs targeted by 61 of the 122 yeast kinases. This large collection of phosphorylation site motifs provided new insight into the structural basis for substrate recognition by protein kinases as a family in a manner not possible through analyses of individual kinases. Furthermore, we used our motif collection to predict new kinase-substrate relationships through database scanning and integration with other yeast proteomic and genomic datasets.
To determine phosphorylation motifs for yeast protein kinases, we developed a high-throughput approach using our previously reported positional scanning peptide library (13). This library consisted of 200 distinct peptide mixtures in which each 16-mer peptide contained a central fixed phosphorylation acceptor (phosphoacceptor) site (an equimolar mixture of Ser and Thr) flanked by degenerate positions consisting of equimolar mixtures of the 20 amino acids excluding Ser, Thr, and Cys, and a carboxy-terminal biotin tag (Fig. 1A). For each of the nine positions surrounding the phosphoacceptor site, there were 22 peptide mixtures in which each of the 20 unmodified amino acids, as well as phosphothreonine (pT) and phosphotyrosine (pY), were fixed. In addition to these 198 (9 × 22) peptide mixtures, two control peptide mixtures bearing either Ser or Thr alone as the fixed phosphoacceptor residue in the context of a fully degenerate sequence were also included. These control mixtures served as indicators of any preference the kinase had for either Ser or Thr residues at the phosphoacceptor site. Peptides were incubated with the kinase of interest in the presence of radiolabeled ATP. At the end of the incubation period, aliquots of each reaction were spotted simultaneously using a capillary pin-based liquid transfer device onto a streptavidin-coated membrane that captured the peptide substrates through their carboxy-terminal biotin tags. After extensive washing, the membrane was dried and exposed to a phosphor screen, allowing the extent of radiolabel incorporation for each peptide to be visualized and quantified. To enable high-throughput analysis, all steps were performed in a 1536-well format, thereby reducing the amount of kinase and peptide required and enabling simultaneous analysis of four kinases.
Three yeast kinases (Tpk1, Tpk2, and Ste20) were assayed with both the miniaturized and large volume formats, and we performed multiple replicates with one of these kinases, Tpk1. Identical results were observed with the two formats and in replicate assays with the 1536-well format (data for Tpk1 is shown in Fig. S1). These kinases also recapitulated preferences of their mammalian orthologs for basic residues upstream of the phosphorylation site (13, 14). These results confirm that the miniaturized peptide library screening system is reproducible and provides data that is quantitatively equivalent to lower throughput approaches.
With our peptide array method, we screened 111 of the 122 yeast kinases. Kinases were initially purified from yeast strains that harbor galactose-inducible expression plasmids bearing either a C-terminal tandem affinity purification tag or an N-terminal glutathione S-transferase (GST) tag (15, 16). In a number of instances, it was necessary to perform the assay in the presence of known activating subunits [(for example, cyclins for cyclin-dependent kinases (CDKs)], phosphorylate the kinase in vitro or co-express it with an activating kinase, or purify the kinase from yeast grown under activating conditions. For kinases with which poor yields were obtained from yeast, we employed alternative bacterial and mammalian cell expression systems. Each kinase was assayed on the peptide substrates in duplicate on separate days. In total, we generated reproducible phosphorylation motifs for 61 of the 111 yeast kinases screened (Fig. 1B and table S1). Three distinct motifs were generated for the cyclin-dependent kinase Pho85 by analyzing separately in complex with different cyclin subunits (Pho80, Pcl1 and Pcl2). The remaining kinases were not sufficiently active to phosphorylate the peptides above background levels. These kinases may be highly specific for particular protein substrates and thus do not phosphorylate peptides efficiently. For example, in keeping with previous observations for their mammalian orthologs (17), we did not observe activity on our peptide substrates for the eight kinases in the mitogen-activated protein kinase kinase (MAPKK) and mitogen-activated protein kinase kinase kinase (MAPKKK) families. Other kinases were likely simply inactive under exponential growth conditions or when assayed in the absence of obligate binding partners and may be suitable for analysis once their activation mechanisms are more completely understood.
Approximately half of the phosphorylation site motifs that we determined for yeast kinases were identical to known motifs, as they corresponded to yeast homologs of mammalian kinases that have been previously characterized (11, 12). In contrast, the remaining kinases and their mammalian homologs have either not been previously characterized (table S2 lists mammalian homologs and indicates which kinases have previously known motifs) or in one instance (Tos3) yielded a different motif from that reported. Representative spot arrays produced by four kinases for which phosphorylation motifs were not previously known (Atg1, Gin4, Mps1, and Prk1) are shown in Fig. 1B. Spot intensities from the peptide arrays were quantified, background corrected, and normalized to provide the selectivity values shown in Table 1. We verified the consensus phosphorylation motifs for these kinases by performing kinase assays using optimized peptide substrates (named ATGtide, GINtide, MPStide, and PRKtide, respectively) consisting of those residues that were most highly selected at each position. As shown in Figure 1C, each kinase was highly specific for its corresponding peptide substrate, thus providing independent validation of our mixture based peptide library screening approach.
Notably, the autophagy-linked kinase Atg1 has an atypical motif exhibiting selections for hydrophobic residues at multiple positions. We verified this motif by making targeted substitutions to the ATGtide substrate. As anticipated, substituting a different favorable hydrophobic residue (Met) at the most selective position (P−3) had no significant effect on the rate of ATGtide phosphorylation. Moreover, substituting unfavorable charged residues at any of three most strongly selective positions dramatically reduced the reaction rate (Fig. 1D).
Normalized, background corrected phosphorylation signals for each kinase were assembled into position weight matrices (PWMs), which are quantitative representations of the phosphorylation motif. We scored each position for its total selectivity, and a specificity heat map of all kinases and positions revealed the wide range of selectivity exhibited by kinases (Fig. 2). At one extreme, Yck1 and Cka1 (yeast casein kinase 1 and casein kinase 2 homologs) were highly sequence specific, with requirements for particular amino acids at multiple positions. At the other extreme, Cak1 and Rad53 were the least selective in that, although the extent of substrate phosphorylation by these kinases is clearly dependent on peptide sequence, there were no residues that were absolutely required at any position surrounding the phosphoacceptor. Most kinases fell between these extremes, with a combination of required residues and more subtle propensities that influence the overall efficiency of phosphorylation. Furthermore, although each position surrounding the phosphorylation site was highly selective for by at least several kinases, kinases were most frequently selective at the P−3 position, followed by the P−2 and P+1 positions. By contrast, few kinases were selective at the P−1 position.
The 61 yeast kinases were clustered into groups on the basis of phosphorylation site selectivity (Fig. 3). 35 kinases were observed to target basophilic motifs. 31 of these showed a classic “basophilic” signature (10), with a strong selectivity primarily for an Arg residue at the P−3 position. This was the single most common feature found among all motifs (Fig. 3, table S1). Four other basophilic kinases, Ipl1, Skm1, Ste20, and Cla4, were selective for Arg at the P−2 position, but did not show strong selectivity for Arg at the P−3 position (Fig. 3 and table S1). The basophilic kinases however diverged with respect to the residues selected at other positions. For example, basophilic kinases are often reported to be selective primarily for either Leu or Arg at the P−5 position, as well as selective for Arg at P−3 (13, 18–20). Among the various kinases that selected Arg at the P−3 position, we observed a spectrum of residues selected at the P−5 position, including Leu (Cmk1 and Cmk2) and Arg (Ypk1), but also Met (Vhs1), Val or Ile (Prr1), and His (Psk2) (Fig. 3 and table S1). The seven proline-directed kinases, which primarily selected for Pro at the P+1 position, were also distinguishable on the basis of selectivity at other positions. For example, Kss1, Hog1, and Fus3 all showed a secondary selectivity for proline at the P−2 position that was not observed by Pho85 or Cdc28. Other motifs were less common, and include multiple distinct “acidophilic” motifs in which the strongest selectivity was for Asp, Glu, or pThr. Such acidophilic motifs have been previously seen for various mammalian kinases, including GSK3 (selectivity for acidic amino acids at the P+4 position), CK1 (P−5 through P−3), PLK (P−2), and CK2 (P+1 through P+3) (21–23). All yeast orthologs of these kinases recapitulated the motif found in their mammalian orthologs (table S2), but we also found additional yeast acidophilic kinases that were not anticipated (Mps1, Gcn2, and Cdc7). In addition, three kinases, Atg1, Kin1, and Kin3, exhibited their strongest selectivities for hydrophobic residues. The remaining kinases exhibited multiple strong selectivities and could not easily be categorized.
Yeast kinases have been classified into five groups on the basis of sequence homology: AGC (PKA/PKG/PKC), CAMK (calcium/calmodulin regulated and structurally similar kinases), CMGC (CDKs, MAPK, GSK, and CDK-like kinases), STE11/STE20, and STE7/MEK (MAPKK) (24). These groups have then been classified further into families that share a high degree of sequence similarity within their catalytic domains. Although related kinases generally recognized similar phosphorylation motifs, kinases within the same family occasionally exhibited differences, both subtle and striking. One family that illustrates striking differences is the Snf1 kinase family, which belongs to the CAMK group. In yeast, the Snf1 [also known as the AMPK (AMP-activated protein kinase)] family has six family members — Gin4, Hsl1, Kcc4, Kin1, Kin2, and Snf1. We identified consensus phosphorylation site motifs for each of these kinases with the exception of Kin2 (Table 1 and table S1). All five kinases had common features in their motifs, which are also shared with mammalian AMPKs (25, 26). For example, each one had preferences for a Ser residue as the phosphoacceptor site, a Ser residue at the P−2 position, an Asn residue at the P+3 position, and hydrophobic residues at the P+4 position (Gin4, Snf1, and Kin1 are summarized in Table 1; see Dataset S1 for quantitative data for Hsl1 and Kcc4). Strikingly, however, only four of the five Snf1 family kinases exhibited the hallmark basophilic P−3 Arg selectivity of the CAMK group, with Kin1 lacking this conserved feature. Instead, Kin1 had an additional preference for an Asn residue at the P−2 position. This difference correlated with a single amino acid substitution within the kinase catalytic domain (Fig. 4A). Gin4, Hsl1, Kcc4, and Snf1 each have a conserved Glu residue (corresponding to Glu127 in PKA, Fig. 4B). Crystal structures of multiple basophilic kinases in complex with peptide substrates have shown that this residue forms a salt bridge with the guanidino group of the P−3 Arg residue of the bound substrate (27–30). Unlike the other family members, Kin1 has a Gln residue in place of this conserved Glu. These observations are thus consistent with a role for Glu127 as the critical specificity-determining residue for Arg at the P−3 position in substrates, at least within the Snf1 family.
However, crystallographic insight into specificity determinants in protein kinases is limited to a handful of cases where structures have been solved of kinase-peptide complexes. Although computational approaches have offered additional insight into structural features that control specificity (31, 32), the existence of alternative binding modes, even between kinases with similar specificity (30), makes it difficult to make general conclusions regarding the relationship of kinase sequence to specificity. Indeed, multiple sequence alignment of the yeast kinome and comparison with our experimentally determined motifs indicated that the presence of an acidic residue at position 127 is neither necessary nor sufficient to direct selectivity for Arg at the P−3 position in substrates. For example, within the CMGC group, members of the MAPK and CDK families (Fus3, Kss1, Hog1, Cdc28, and Pho85), which are proline-directed kinases, have an Asp residue at that position, despite a lack of selectivity for Arg at the P−3 position. Conversely, Yak1 within the same group is basophilic, yet lacks an acidic residue at that position (Table 1 and Fig. 4A). Presumably, other residues within the catalytic domain are responsible for dictating a basophilic signature within this group of kinases.
With our large collection of kinase motifs, we identified previously unknown specificity-determining residues, including, but not restricted to, residues that might confer P−3 Arg selectivity for kinases that are not part of the Snf1 family. We used an approach based on the idea of co-variation (33). We identified residues whose variation in the primary sequence of the catalytic domain significantly correlated with the variation in phosphorylation site specificity across kinases. To measure sequence variation, we used a simple pairwise similarity matrix, and to compare specificities, we calculated the Frobenius norm of the differences in PWMs (Table 2 and Fig. 4B). This approach reproduced several specificity-determining residues previously known from both structural and mutagenesis studies, including Glu127. In addition, we uncovered many previously unknown candidate specificity-determining residues, seven of which were predicted to be within ten angstroms to a bound protein substrate. Among these, an acidic Glu residue at position 170 (PKA numbering) correlated with P−3 Arg selectivity among CMGC kinases. This result contrasts with a previous prediction based on modeling of DYRK1A, the human homolog of Yak1 (34). To test our predictions, we examined the role of residue 170 in substrate selection. Indeed, a Ser to Glu mutation at the analogous position in the MAPK Kss1 (residue 147) conferred a basophilic signature (Fig. 4C and Fig. S2). This result validates our ability to predict new specificity-determining residues on the basis of our large motif dataset.
Because in vivo phosphorylation sites on protein substrates tend to fall within the context of the phosphorylation site motif for a particular kinase, database scanning has been used to predict new substrates and to pinpoint sites of phosphorylation (14, 26, 35–39). However, simple sequence matching approaches are prone to false positives, because predicted sites may not be accessible for phosphorylation, and kinases can also depend on docking or scaffolding interactions for substrate recruitment. In addition, false negatives are frequent for kinases with low sequence specificity because their motifs occur in many proteins and are, thus present with high frequency in databases (14, 18). To increase the accuracy of such predictions, we generated and used a motif analysis pipeline, MOTIPS (http://motips.gersteinlab.org/). MOTIPS scans sequence databases for sites that most closely match the PWM for a particular kinase using a modified algorithm based on the program Scansite (40). Predicted sites are then scored on the basis of a panel of features (evolutionary conservation, predicted surface accessibility, and disordered structure) that are characteristic of known phosphorylation sites (41–43).
We first analyzed established kinase substrates for the presence of their respective phosphorylation site motifs with MOTIPS. From a sampling of 174 in vivo kinase-substrate relationships curated from the literature, 99 of the substrates ranked among the top 0.5% of predicted sites for their respective kinase, with 27 substrates falling within the top 200 sites (Fig. 5A). We next analyzed predicted substrates for each of the 61 yeast kinases for their associated biological processes and respective localization according to Gene Ontology (GO) assignments in the Saccharomyces Genome Database (44) (Fig. 5B; the full list of predicted substrates for each kinase with associated GO terms and MOTIPS features is provided as Dataset S2). We found that predicted substrates were more likely to be associated with the same biological process and to localize to the same subcellular compartment as their respective kinases than a randomly chosen set of proteins. Taken together, these observations suggest that motif scanning using our set of phosphorylation site motifs enriches for authentic kinase-substrate pairs.
To establish directly that our bioinformatics analysis had uncovered authentic substrates, we examined more closely the predicted substrates of the protein kinase Prk1. Prk1 is a member of a small family of kinases conserved throughout eukaryotes that mediates reorganization of the actin cytoskeleton during endocytosis (45). Our peptide array analysis revealed an unusual phosphorylation site motif that included strong preferences for aliphatic residues at the P−5 position, Gly at the P+1 position, and Thr as the phosphoacceptor (Fig. 1B, Table 1). We selected 107 Prk1 candidate substrates identified by MOTIPS for further analysis. These substrates contained sites of high, middle, and low rank among the top 2,000 scoring sites. Because all five known Prk1 substrates undergo multisite phosphorylation (45–47), candidates were also chosen for having at least three predicted Prk1 phosphorylation sites. Of the 107 candidate substrates, we observed phosphorylation of 19 candidates in vitro with wild-type Prk1 but not with a Prk1 inactive mutant (Fig. S3). To identify additional candidates, we used these 19 candidates as positive data points in a training set to educate MOTIPS by machine learning. Negative data points in the training set included 81 of the original Prk1 candidates that were unambiguously not substrates in vitro, as well as about 400 proteins identified in the yeast protein database as localizing solely to non-cytosolic compartments (48).
This set of positive and negative data points was used to re-train the Bayesian algorithm in MOTIPS to integrate the motif matching, conservation, surface accessibility, and disorder scores for each site, along with an additional score based on the number of predicted sites. The five known in vivo substrates of Prk1, which were excluded from the training set, all fell within the top seven targets (Fig. 6A). Five additional candidates taken from the top 15 putative substrates in the new Prk1 hit list were tested by an in vitro kinase assay that used the purified candidates as substrates. These in vitro assays revealed three additional new substrates for Prk1— Gon7, a protein component of the EKC/KEOPS (Endopeptidase-like Kinase Chromatin-associated/Kinase, putative Endopeptidase and Other Proteins of Small size) complex involved in telomere regulation, Gph1, a protein involved in the mobilization of glycogen, and the key endocytic protein Las17. One of the five additional candidates tested was Ypl150w, which is a putative kinase that autophosphorylated in our assay and thus could not be confirmed or excluded as a substrate of Prk1. This second round of in vitro assays provides additional evidence that retraining our algorithm increased our success rate in predicting authentic kinase substrates. Furthermore, among the 22 in vitro confirmed Prk1 substrates, seven proteins (Bem2, Ede1, Las17, Sac3, Sla2, Syp1, and Yap1801) are reported to have roles in endocytosis or the regulation of the actin cytoskeleton, suggesting that they may be subject to regulation by Prk1 (Table 3).
We next investigated whether our predicted Prk1 candidate substrates represented bona fide substrates. Because a closely related kinase, Ark1, has an overlapping biological function and shares a nearly identical phosphorylation site motif with Prk1, we examined the phosphorylation state of candidate substrates in yeast strains deleted for both PRK1 and ARK1. Changes in phosphorylation were monitored by electrophoretic mobility shifts in immunoblots of purified substrates, with phosphatase-treated samples serving as a control for the unphosphorylated species. We observed a change in mobility for two candidate substrates, Bem2 and Ede1, suggesting that they are in vivo targets of Prk1 or Ark1, or both (Fig. 6B). Although we did not observe gel shifts for other substrates, it is likely that some are authentic Prk1/Ark1 substrates as well but simply do not change mobility upon phosphorylation. Notably, previous mass spectrometry (MS) phosphoproteomic analysis identified three of the in vitro Prk1 substrates (Ede1, Syp1, and Rpl5) as phosphorylated at Prk1 consensus sites in vivo (49–54) (the MOTIPS output for all kinases, which is available as Dataset S2, indicates which candidate phosphorylation sites have been identified by MS).
We also validated kinase-substrate pairs through integration with other proteomic datasets. We found that the kinase Vhs1, for which limited functional information is known, exhibited selectivity for the phosphorylation site motif MXRXXS (table 1 and table S1. Fourteen in vitro substrates for the kinase Vhs1 (55) were previously identified by protein microarray analysis (4), and six of these, Mga1, Pfk26, Sef1, Sol1, Sol2, and Utr1, contain the Vhs1 consensus phosphorylation site motif. MS phosphoproteomic analysis (49) revealed that Sef1 was phosphorylated in vivo at a Vhs1 consensus phosphorylation site and in an immunoprecipitation-MS analysis Sef1 and Vhs1 physically interacted (56). In addition, MS phosphoproteomic analysis identified Sol1 as phosphorylated at a Vhs1 consensus phosphorylation site in vivo (50), and its homolog Sol2 was the most highly phosphorylated Vhs1 in vitro substrate identified by protein microarray analysis (4). Mobility shift analysis of VHS1 deletion strains using Phos-tag SDS-PAGE (57) was consistent with Sol2 as a substrate for Vhs1 in vivo (Fig. 6C). Though the presence of multiple Sol2 species in the presence and absence of Vhs1 indicates phosphorylation at multiple sites, likely by more than one kinase, the mobility shift indicates that in vhs1 mutant cells, Sol2 is phosphorylated at fewer sites. Sol2, which promotes nucleocytoplasmic tRNA transport (58), is the first reported in vivo substrate for Vhs1 and suggests a role for this kinase in regulating this process. These results illustrate how integration of data from multiple proteomic approaches can shed light on the biology of poorly characterized molecules.
The elucidation of the mechanisms underlying kinase specificity remains an integral part of understanding phosphorylation-based signal transduction pathways. Previous methods for determining consensus phosphorylation site motifs have not been suitable for large-scale screening of a eukaryotic kinome. Here, we have described an approach for the high-throughput identification of consensus phosphorylation site motifs in which multiple kinases, with no previously known substrates, can be analyzed simultaneously. We have used this approach to provide comprehensive analysis of kinase specificity in a single eukaryotic organism, the yeast Saccharomyces cerevisiae. Among other applications, this large dataset has provided much broader insight into the structural basis for kinase selectivity than has been possible through individual analyses of single kinases.
With our data, we linked protein kinases to previously unknown substrates, thus elucidating mechanisms of phosphorylation-dependent signaling. A limitation to our approach, however, is that the peptide arrays treat each position in the substrate independently, and thus the potential interdependence between multiple positions is ignored. This approach is nonetheless a valuable first pass screen for analyzing kinase specificity because it involves the systematic and exhaustive analysis of each amino acid residue at each position surrounding the phosphorylation site. Preferences observed with this approach can provide the basis for the design of kinase-specific peptide libraries to uncover positional interdepedence. Furthermore, the presence of a consensus phosphorylation sequence alone is insufficient to direct phosphorylation of a protein by a particular kinase, and accordingly identification of previously unknown substrates on the basis of motif scanning is difficult. However, integration with other proteomic datasets provides a means of increasing confidence in predicted kinase-substrate relationships. In addition, specific kinase-substrate pairs can be inferred through computational methods that make use of non-sequence-based “contextual” features, such as subcellular localization and molecular function (38). For example, predicting substrates targeted by relatively nonspecific kinases using phosphorylation site motifs alone is unlikely to be successful because these sequences occur frequently in proteomes. In such cases, selection of authentic substrates is driven by docking or scaffolding interactions, and consensus sequences for substrate recruitment can be used in combination with phosphorylation site motifs to identify new substrates (59, 60).
For previously characterized kinases, we observed a high degree of conservation of phosphorylation site motifs between yeast and mammalian orthologs. These similarities suggest that the many previously unknown consensus motifs reported here are also conserved. Therefore, this dataset will serve as a resource for studies of phosphorylation-dependent signaling in higher eukaryotes, as well as yeast.
Details regarding yeast strain information, kinase preparation, characterization of purified kinases, in vitro kinase assays, and electrophoretic mobility shift analyses are available in the Supplementary Material.
The peptide library (Anaspec, Inc.) has been previously reported (13). For this study, fresh stock solutions were made from 5 mg powder by dissolving peptides in DMSO, quantifying by absorbance at 280 nm, and adjusting to a stock concentration of 10 mM by adding the appropriate volume of DMSO. Stock solutions were stored at −20°C in microcentrifuge tubes. Working 0.6 mM aqueous stocks were prepared by diluting the DMSO stock in 20 mM HEPES, pH 7.4 and arrayed into 1536-well stock plates containing 5 μl aliquots in each well. Plates were sealed with adhesive foil and stored at −20°C.
Peptides (0.2 μl per well) were transferred to assay plates containing 2 μl of kinase reaction buffer (generally 20 mM HEPES, pH 7.4, 10 mM MgCl2, 1 mM DTT, 0.1% Tween 20) from stock plates manually using a 48 × 6 slot pin replicator (VP Scientific). Reactions were initiated by adding a solution (0.2 μl per well) containing purified kinase and γ-[33P ]-ATP (0.55 mM, 0.3–0.4 μCi/μl, Perkin Elmer) using a 48 × 1 slot pin replicator (VP Scientific). Plates were sealed and incubated for 1 to 8 hr at 30°C. The final concentrations of the reaction components in each well were 50 μM peptide and 50 μM ATP at a specific activity of 0.55–0.73 mCi/μmol. After incubation, 0.2 μl from each well was spotted onto streptavidin-coated membrane (SAM2 Biotin Capture Membrane, Promega) simultaneously using the 48 × 6 slot pin replicator. Membranes were washed three times with 10 mM Tris-HCl, pH 7.5 with 140 mM NaCl and 0.1% SDS, twice with 2 M NaCl, twice with 2 M NaCl with 1% H3PO4, and twice with water, then dried and exposed to a phosphor storage screen. Processing of final images of the spot arrays consisted of copying the 4 × 22 grid corresponding to the P+1, P+2, P+3, and P+4 peptide mixtures and pasting it below the 5 × 22 grid corresponding to the P−5, P−4, P−3, P−2, and P−1 peptide mixtures using Adobe Photoshop to provide the 9 × 22 spot grids shown in Figure 1 and table S1.
For each array, peptide phosphorylation signals were quantified using Genepix Pro 6.0 (Molecular Devices) by manually aligning a 48 × 8 grid of circles onto each scanned phosphorimage to calculate the median intensity for each spot. These median intensity values were then background corrected by subtracting the median intensity value corresponding to the negative control spot (reaction carried out in the absence of any peptide substrate). Signal scores for each amino acid at each position were then normalized by the following equation
where Zca stands for the normalized score of amino acid a at position c having a signal score Sca, and m stands for the total number of amino acids. Sci is the signal score of amino acid i at position c where i is defined in the summation of all the m amino acids.. The PWM is an N × 20 matrix of N positions with the normalized, background corrected value given as the weight for each amino acid at each position. To account for spurious phosphorylation of Ser and Thr residues at other positions, the PWM entries in all Ser and Thr positions were set to one (equivalent to neutral selection at that position) with subsequent renormalization of the PWM.
The entire yeast proteome was scanned to identify the best matches to each PWM. Our approach used a window-sliding method based on the normalized PWM similar to the method used in Scansite (40). Briefly, it extracted every possible 15-mer sequence from the yeast proteome and calculated the match score to the PWM, based on the formula:
where i stands for the position in the motif and ri stands for the residue that is present at position i in the peptide in question. Mia is the normalized PWM as described above. The resulting score was then normalized, such that zero stands for an optimal match to the motif and larger positive scores correspond to weaker matches. The top 10,000 potential phosphorylation sites for each kinase are reported in the Dataset S2. This algorithm was implemented in a modular form in Java. All sequences and features were loaded into a SQL database that is interactively queried by the Java search module.
A number of different genomic features were gathered to supplement the initial match score. To compute the conservation score, we collected all orthologs for 13 proteomes of related yeast species (Saccharomyces paradoxus as the closest and Schizosaccharomyces pombe as the farthest) using the comparative genomics algorithm implemented in INPARANOID (61). We then aligned these orthologs using the automated alignment method MUSCLE (62) (the full set of alignments is available as Dataset S3). For each PWM hit, we calculated the conservation score by estimating the entropy at each position based on the aligned orthologs with the AL2CO program. The disorder score was based on the prediction program DISOPRED (63). DISOPRED was run for each protein in the yeast proteome. We used the DISOPRED probability score, corresponding to the likelihood of the residue in question being in a disordered region, as the measure of disorder. Finally, the surface accessibility score was calculated using the prediction program SABLE for each protein in the yeast proteome (64). The simple numerical surface score was used as the measure of surface accessibility.
An integration algorithm based on the Naïve Bayes framework was used to integrate the four features. We used a number of experimentally determined gold-standard kinase substrate pairs, “positives,” to train the algorithm. For gold-standard negatives, we supplemented a set of experimentally determined negatives with a set of randomly chosen protein pairs. Each of these pairs is a pair of proteins that are annotated to always localize to two different compartments (for example, nucleus only and cytoplasm only). Thus, we biased the randomly chosen set of protein pairs further towards a set that was highly unlikely to contain any spurious positive interactions. The conditional probability was calculated from the four features according to the following formula:
where I denotes either interaction or non-interaction and D1 through D4 denote the four features. Data were thus integrated under the assumption that the four features are independent. To formally assess independence of the features, we calculated pairwise correlation coefficients. The results showed the pairwise correlation coefficients ranging from 0.01 to 0.57 (absolute values) have an average of 0.18, indicating the features are to a large extent independent (see table S3). Moreover, we performed Principle Component Analysis (PCA) using the statistical software R to transform the possibly correlated values of the five features (hits per protein, match score, disorder score, accessibility score, and conservation score) of the PRK1 targets into uncorrelated values. The first three vectors were chosen to build a Naïve Bayes model followed by a 10-fold stratified cross validation. The Area Under Curve (AUC; 75.9%) of the Receiver Operating Curve (ROC) resulting from the PCA validation was then compared to the AUC (78.6%) of PRK1 without the PCA transformation. The very close performance of the two further indicated a certain level of independency of the features. Bayesian integration was implemented using the Java machine learning package Weka (65). The entire methodology is available as the modularized software packages MOTIPS (URL: http://motips.gersteinlab.org/).
Sequences of the 61 yeast kinase catalytic domains (obtained from the kinase.com database) were initially aligned using ClustalW2 (66). A high quality sequence alignment was generated by manual editing of the initial alignment in Jalview (67) on the basis of multiple pairwise alignments with kinases of known 3D structure and conserved catalytic residues (table S4). In addition, 89 orthologous kinases from S. pombe, D. discoideum, and H. sapiens were added and manually aligned. For these orthologs, the PWM was inferred to be identical to its yeast counterpart. A correlation-based methodology was implemented to identify specificity determining residues:
For each pairs of sequence positions (n) and positions in the PWM (m), two -dimensional vectors were generated; k is the total number of kinases in the alignment and is equal to the number of PWMs. The first vector contained all pairwise similarities between the primary sequences of the kinases in that position, based on the McLachlan matrix (that is the similarity of the amino acid in position X in kinase A to the similarity of the amino acid in the same position in kinase B) (68). The McLachlan matrix was chosen because it scores for residue substitutions based on chemical similarity (i.e., physico-chemical properties). The second vector contained the pairwise similarity of all PWMs to each other, based on the Frobenius norm (69): .
Each position was then scored with the Pearson correlation coefficient of these two vectors (listed under “correlation” in Table 2). This method was implemented in the programming package MATLAB. Distances of the residue in question from bound peptide were estimated by mapping the residue onto the PKA-PKI structure (PDB ID: 1ATP) using the program VMD. The peptide-kinase distances were measured as the closest distances between the geometric centers of the residue on the kinase, as mapped to the PKA structure, to the bound peptide, as in the PKA structure.
Materials and Methods
Fig. S1. Assay reproducibility.
Fig. S2. Representative peptide array screening results for the Kss1 S147E mutant.
Fig. S3. Representative Prk1 in vitro assays.
Table S1. Representative peptide array screening results and sequence logos for each of the 61 kinases assayed.
Table S2. Protein kinases analyzed in this study.
Table S3. Pairwise correlation coefficients for each of four genomic features and the Scansite match score.
Table S4. Alignment of yeast kinases analyzed in this study.
Dataset S1. Average PWMs for each of the 61 kinases assayed (plain text files).
Dataset S2. MOTIPS output for each of the 61 kinases assayed (plain text files).
Dataset S3. MUSCLE alignment of all predicted S. cerevisiae ORFs with orthologs from 12 other yeast species (clustal alignment files).
Dataset S4. Alignment of yeast kinases analyzed in this study (clustal alignment file).
Supported by US National Institutes of Health grants to M.S., B.E.T. (GM079498), N.M.H. (GM50717), D.F.S. (CA82257) and by a Swiss National Science Foundation grant to C.D.
Editor’s Summary: Exploring Kinase Selectivity
Kinases are master regulators of cellular behavior. Because of the large number of kinases and even larger number of substrates, approaches that permit global analysis are valuable tools for investigating kinase biology. With a miniaturized peptide library screening approach, Mok et al. identified the phosphorylation site selectivity for 61 of the 122 kinases in Saccharomyces cerevisiae. By integrating this data with other datasets and structural information, they revealed information about the relationship between kinase catalytic residues and substrate selectivity. They also identified and experimentally verified substrates for kinases, including one in which limited functional information was previously available, demonstrating the potential for this type analysis as a launching point for the exploration of the biological functions of kinases.
Author contributions: J.M., S.P., G.R.J., D.L.S., S.A.P., V.D. and B.E.T. performed experiments; P.M.K., H.Y.K.L. and M.B.G. performed computational work; J.M., X.Z., G.R.J., S.A.P., V.D., M.J., E.C., H.N., M.G., A.R., J.N.M., Y.S., H.E.S., R.S., C.S.M.C., C.D., N.M.H., W.A.L., D.F.S., B.S., and B.J.A. prepared and characterized protein kinases and expression constructs; J.M., P.M.K., H.Y.K.L., M.B.G., M.S. and B.E.T. designed experiments, analyzed data and wrote the paper.
Competing interests: M.S. consults for Affomix, which has an interest in proteomics, including phosphoproteomics.