|Home | About | Journals | Submit | Contact Us | Français|
Protein phosphorylation plays a central role in cellular regulation. Recent proteomics strategies for identifying phosphopeptides have been developed using the model organism Saccharomyces cerevisiae, and consequently, when combined with studies of individual gene products, the number of reported specific phosphorylation sites for this organism has expanded enormously. In order to systematically document and integrate these various data types, we have developed a database of experimentally verified in vivo phosphorylation sites curated from the S. cerevisiae primary literature. PhosphoGRID (www.phosphogrid.org) records the positions of over 5000 specific phosphorylated residues on 1495 gene products. Nearly 900 phosphorylated residues are reported from detailed studies of individual proteins; these in vivo phosphorylation sites are documented by a hierarchy of experimental evidence codes. Where available for specific sites, we have also noted the relevant protein kinases and/or phosphatases, the specific condition(s) under which phosphorylation occurs, and the effect(s) that phosphorylation has on protein function. The unique features of PhosphoGRID that assign both function and specific physiological conditions to each phosphorylated residue will provide a valuable benchmark for proteome-level studies and will facilitate bioinformatic analysis of cellular signal transduction networks.
Database URL: http://phosphogrid.org/
Cellular responses to physiological signals, including cell growth, differentiation and death are mediated by post-translational protein modifications, most notably phosphorylation, which function to transmit signals to downstream effectors and target molecules (1,2). At least one half of all proteins in a typical eukaryotic cell are phosphorylated (3); site-specific phosphorylation on serine, threonine and tyrosine residues is thus the most abundant and well-characterized intracellular post-translational modification. The addition or removal of phosphate groups by protein kinases and phosphatases, respectively, can regulate protein interactions, activity and conformation (4). The budding yeast genome encodes 130 protein kinases and some 40 protein phosphatases (5,6), while the human genome encodes more than 500 protein kinases and over 100 protein phosphatases (7–9). The vast combinatorial possibilities afforded by the global kinase–phosphatase network presents an enormous challenge in deconvolving the information flow that underlies cellular behavior (10).
The development of high throughput strategies for detection and sequence determination of phosphopeptides offers the potential to exhaustively catalogue the phosphorylation status of the proteome under different conditions (11). However, the full biological significance of this information will only be realized through the identification of the enzymes that regulate each specific phosphorylation site, the conditions under which the phosphorylation occurs, and the functional consequences of the modification for protein function (12). Delineation of complete signaling networks and regulatory pathways will require a combination of approaches to assign these parameters, in combination with bioinformatics and modeling tools to organize and analyze the information.
Because of the powerful array of genetic, molecular biological, genomic and proteomic strategies developed for S. cerevisiae, this organism has become a model of choice for global characterization of cellular regulatory networks and for implementation of novel functional genomic methods. The scope of genomics and proteomics resources available for S. cerevisiae includes: protein interaction networks derived from two-hybrid and mass spectrometry data (13,14), genetic synthetic lethal interactions (15,16), subcellular compartmentalization (17), global gene expression patterns under a variety of conditions (18,19), global identification of protein-DNA interactions (20), and comparative fungal genomics (21,22). Combined with rapid progress in identification of phosphorylated residues, these resources should eventually enable comprehensive predication of phospho-regulatory networks (12).
In order to facilitate the analysis and prediction of protein kinase/phosphatase-substrate relationships and signaling networks, we have developed a database of experimentally verified in vivo protein phosphorylation sites for S. cerevisiae. The initial version of the database, designated PhosphoGRID, documents approximately 5000 individual phosphorylated residues on 1495 gene products annotated from the published literature. For each phosphorylated residue, where data is available, we record relevant protein kinases and phosphatases, specific conditions under which the modification occurs, and the effect on protein function. All entries in PhosphoGRID are linked to other existing online yeast resources, including the BioGRID interaction database (13) and the Saccharomyces Genome Database (SGD) (23). PhosphoGRID will also provide an important resource to benchmark mass spectrometry-based methods for the global assignment of phosphorylation sites (24–26).
Several online protein phosphorylation resources have been described previously, but most of these do not contain a significant focus on S. cerevisiae. NetPhos and Scansite are online search tools that enable prediction of phosphorylation sites based on consensus sequences defined in vitro (27,28). These web-based tools are useful in predicting candidate sites in cases where a kinase–protein substrate relationship has been established in vivo, but they suffer from over-prediction and therefore have limited usefulness for identifying phosphorylation sites with physiological relevance. Furthermore, because these prediction tools are largely devoted to metazoans, they are reported to be less reliable for prediction of potential sites in S. cerevisiae, which has at least 32 unique protein kinases (29). Consequently, a phosphorylation site prediction tool specific for Saccharomyces, NetPhosYeast (30), has recently been described. PhosphoSite is a curated web-based resource for physiologically relevant phosphorylations in mammals (31). A similar database, Phospho.ELM (formerly known as PhosphoBase), contains a collection of defined eukaryotic phosphorylation sites, but is not focused on any one species (32); less than 150 entries in Phospho.ELM represent sites from ‘other species’, including yeast. A number of phosphorylation site databases are focused on individual or a limited number of species, including for archea and prokaryotic organisms (33), Arabidopsis (34), and more recently PhosphoPep, which contains data from proteomics initiatives for model organisms including S. cerevisiae, Drosophila and C. elegans (35). Similarly, PHOSIDA contains data produced from mass spectrometry of phosphoproteomes from a variety of eukaryotic and prokaryotic species, but currently has no data from yeast (36,37). PhosphoGRID is thus the first online resource that currently focuses exclusively on experimentally defined phosphorylation sites in the budding yeast S. cerevisiae. PhosphoGRID documents sites from both mass spectrometry-based proteomics efforts and from focused studies on individual gene products; moreover, PhosphoGRID is the first resource to link each specific phosphorylation events with relevant physiological conditions, protein kinases and protein phosphatases.
Consistent annotation is essential in order to establish a non-redundant collection of phosphorylation sites on proteins and to ensure accuracy for search queries and curation efforts. PhosphoGRID utilizes annotation compiled from the Saccharomyces Genome Database including protein names, descriptions, aliases, sequences, Gene Ontology (GO) mappings, and external database identifiers (23). All ancillary information is compiled via an in-house annotation compilation system (ACS) written in Java SDK version 1.5 (java.sun.com). PhosphoGRID annotation tables are updated on a bi-monthly basis and seamlessly integrate with existing curation to ensure that searches always reflect current annotation.
Data contained within version 1.0 of PhosphoGRID is curated from all papers published prior to the end of 2008. We examined abstracts from approximately 1400 published manuscripts from PubMed with keywords relating to phosphorylation in S. cerevisiae (yeast, phosphorylation, residue, phosphorylation site, protein kinase), and/or that had been flagged with relevance to protein phosphorylation within the yeast BioGRID database (13). Abstracts from 514 of these papers indicated possible reference to specific phosphorylated residues, and these were examined in detail. Of this subset, 332 contained descriptions of specific phosphorylated residues. The vast majority of defined phosphorylation sites were derived from four large-scale mass spectrometry-based analyses of phosphopeptides (24,26,38,39). For each residue identified as a specific phosphorylation, we noted the evidence(s) for that phosphorylation, as well as whether a protein kinase or phosphatase, function, or specific condition was associated with the residue, and whether the phosphorylation had a defined effect on the protein activity. For each phosphorylation site listed in the dataset, we also verified that the residue number cited in the literature corresponds with the sequence in GenBank. We observed a substantial number of inaccurately stated residue positions that primarily arise because of a discrepancy in the actual translational start site, or because the open reading frames, as documented in SGD, generally do not reflect post-translational cleavage of the gene product. In such cases, the position of the phosphorylation site in PhosphoGRID was mapped to the corresponding residue in the ORF as documented in GenBank and a free text comment in a ‘Notes’ field was used to document the discrepancy. The main annotation categories in PhosphoGRID were assigned as follows:
Phosphorylation information on any gene product of interest can be accessed through the search interface (Figure 1, top right). The search retrieval page display provides the protein amino acid sequence with all documented, experimentally verified phosphosites highlighted as red text (Figure 1). Upon mouse-over of each phosphosite, a pop-up window provides a summary of the phosphorylation site evidence, as well as the specific condition under which it occurs and functional consequence, where known. Consensus sequences for a limited number of protein kinases with well-defined specificity, which overlap verified phosphosites, are indicated in blue text on the amino acid sequence. This feature will be expanded in future updates as consensus sites for more yeast protein kinases are elaborated (30). Tables below the protein sequence provide details on each experimentally identified phosphosite, including experimental evidence, functional consequences (Figure 1, lower), and identity of the cognate protein kinases and/or phosphatases (Figure 2), and where relevant, specific regulatory subunits. For protein kinases and phosphatases themselves, and their corresponding regulatory subunits, an additional table displays sites of phosphorylation/dephosphorylation for known substrate proteins, and includes a summary of the evidence(s) for involvement in these reactions, as well as links to the corresponding substrate pages. An example of this feature is shown for the mating pheromone MAP kinase Fus3 (Figure 2). Each record also provides links to additional resources for each gene product provided at SGD and the NCBI protein database. Finally, for each site of phosphorylation and associated evidence codes, hyperlinks are provided to the original articles listed in PubMED from which the data was curated.
All of the data within PhosphoGRID is freely downloadable in text file format through the ‘Downloads’ tab (Figure 1, top). Download data is refreshed regularly to correspond with new phosphorylation site entries as well as annotation updates via the ACS. In future updates, we will include support for additional download formats including PSI-MI2.5 (54), Osprey (55) and Cytoscape (56). In order to help maintain a current dataset, we have also implemented an online submission form, accessible through the ‘Contribute’ tab (Figure 1, top), through which users can contribute unpublished or newly published information. Contributions will be accepted for residues where evidence of in vivo phosphorylation is documented by one or more experimental evidence(s) as indicated in the ‘Experimental Evidence for Phosphorylation’ field. All PhosphoGRID corrections and clarifications can also be sent to gro.dirgohpsohp@nimda.
Data in version 1.0 of PhosphoGRID was curated from S. cerevisiae publications up to 31 December 2008. The vast majority of phosphosites, greater than 4200, were generated from four seminal high throughput (HTP) proteomics studies based on mass spectrometric analysis of phosphopeptides derived from total cell protein (24,26,38,39). A total of 851 phosphorylated residues were identified by analysis of individual proteins and/or purified protein complexes in dedicated LTP studies. Surprisingly, the overlap between the HTP and LTP datasets is relatively modest as only 149 of the 851 sites in LTP data are found in HTP studies (Figure 3A). This limited concordance illustrates the difficulty in systematically mapping phosphorylation sites and suggests that existing phosphoproteome datasets are probably highly incomplete. Based on the overlap of phosphorylation sites identified in three large-scale studies (24,38,39), and overlap between sites identified in HTP versus LTP studies, we predict that the yeast proteome may contain on the order of 15 000 phosphorylated residues. Approximately 80% of the phosphorylated residues documented in PhosphoGRID occur on serines, with threonine and tyrosine representing 19 and 1.3% of phosphorylated residues, respectively (Figure 3B); these proportions are roughly similar for sites identified in HTP and LTP studies (not shown). Yeast do not have phosphotyrosine-specific protein kinases akin to those in metazoan cells (30), and so it is interesting that the relative proportion of phosphorylated tyrosine residues in vivo is similar to that observed in higher eukaryotes (57). This observation supports the view that some protein kinases have more relaxed hydroxyl amino acid specificity than is generally appreciated; indeed phosphorylation on tyrosine residues is frequently observed in vitro with various serine/threonine protein kinases (58).
To date, 1495 of the 5584 proteins encoded by the yeast genome appear to contain one or more phosphorylated residues (Figure 4). Given that the phosphoproteome is incompletely charted, it seems probable that most, if not all, yeast proteins will be phosphorylated under one or more conditions. Greater than one-third of phosphoproteins recorded in PhosphoGRID have a single identified phosphorylation site, while the remainder are multiply phosphorylated on anywhere between 2 and greater than 40 sites (Figure 4B). Proteins with large numbers of phosphorylated residues include Rpo21, Swe1, Cdh1, Net1 and Rad53, each having greater than 30 separate entries. Rpo21, also known as Rpb1, is the largest subunit of RNA Polymerase II that contains a C-terminal domain (CTD) consisting of 26 direct repeats of the heptapeptide YSPTSPS. Phosphorylation and dephosphorylation of serines 2, 5 and 7 on the heptapeptide repeat govern the transcription cycle through regulated assembly of various subcomplexes that modify polymerase function (59–62); combinations of CTD phosphorylation events might produce a CTD ‘code’ for transcription (63,64). We note, however, that evidence for phosphorylation of the CTD is limited to recognition by antibodies specific for Ser2, Ser5 and Ser7 phosphorylated heptapeptides, and there has been no direct demonstration of phosphorylation on individual repeats within the CTD, nor has the extent to which the CTD can be multiply phosphorylated in vivo been established.
In considering proteins with numerous reported phosphorylation sites, it is apparent that there are biases in the identification of residues in HTP versus LTP approaches. For example, Net1, one of the most heavily phosphorylated proteins studied to date (Figure 4), has a total of 34 identified phosphosites; 9 of these are derived from proteomics efforts, and 25 from two studies that examined the role of phosphorylation in Net1 function (65,66); curiously though there are no sites in common between these studies (Table 5). Similarly, Pan1 has a total of 24 phosphosites, 8 from HTP studies and 16 from focused LTP studies (67,68), none of which are in common. Numerous additional similar anomalies exist (Table 5, and data not shown). For these heavily phosphorylated proteins, the differences may in part reflect the fact that most of the phosphorylations identified in focused studies occur under specific physiological conditions. For example, phosphorylation of Swe1, Net1 and Sic1 are primarily limited to specific phases of the cell cycle (42,66,69), and consequently these sites are likely to be underrepresented in samples from unsynchronized cells typically used for analysis in HTP studies. Similarly, most of the phosphorylation events characterized on Rad53 and Rad9 occur in response to DNA damage (70,71). Considering that 424 phosphorylation sites identified in LTP studies, nearly half of the total, are associated with a specific physiological condition (Table 3), the modest overlap with HTP may reflect the significant effects of environmental conditions. Apart from the major studies examining differential phosphorylation in pheromone-treated cells (24,26) (Table 3), there have not been other large-scale proteomics efforts examining phosphorylation under additional physiological conditions.
As noted, an important feature of PhosphoGRID is that we have documented effects that each phosphorylation has on the target protein activity, where available. Currently, 490 phosphorylation sites are known to affect protein activity (Table 2). Encouragingly, approximately three-fourth of phosphorylation events identified in LTP studies are associated with phenotypic consequences, as revealed by mutational analysis (Figure 5, right); however, this strong correlation may result from study bias, in that only phosphorylation sites linked to a biological response are reported in the literature. Many phosphorylation events have a cumulative influence on protein activity such that phenotypes may be revealed only by combinatorial mutation of multiple phosphoacceptor sites. A well-characterized example is the finding that six Cdc28-dependent phosphorylation events on Sic1 are required for its recognition by Cdc4; this multisite dependence confers ultrasensitive or switch-like behavior on the degradation of Sic1 (42). The preponderance of multiply phosphorylated proteins in PhosphoGRID suggests that many phosphorylation-dependent responses may be imbued with similar qualities (72).
PhosphoGRID is a repository for protein phosphorylation information in S. cerevisiae, particularly for data derived from LTP studies reported in the primary literature. As illustrated here, the LTP dataset provides a benchmark for HTP proteomic studies and will be an important resource for the construction of mathematical models of signaling networks. The initial release of PhosphoGRID contains all data published prior to 2009; we will build on this comprehensive dataset with regular curation updates, in conjunction with elaboration of the repertoire of search and display functions within the resource. Future PhosphoGRID releases will also have expanded capabilities, including documentation of in vitro phosphorylation of substrates by specific protein kinases, where specific residues have not been identified, demonstrated in both high throughput (58) and focused studies. In combination with expanded protein kinase consensus site prediction capability, this information will be important for bioinformatic analysis of signaling networks.
In order to provide an up-to-date and complete resource, we encourage community contributions of new data through the online data submission feature; in this latter regard, we also believe it will be important to report instances where phosphorylation site mutations do not yield an obvious phenotype, particularly as such data is rarely published. A long-term challenge for phosphoproteomics will be to fill in the enormous void in our understanding the functional consequences of the myriad of phosphorylation events in the cell; PhosphoGRID should help meet this challenge.
A Canada Research Chair in Functional Genomics and Bioinformatics, a Royal Society Wolfson Research Merit Award and the Scottish Universities Life Sciences Alliance through the Scottish Funding Council (to M.T.); Canadian Cancer Society Research Institute grant 0011258 (to I.S.), NIH National Center for Research Resources grant 1R01RR024031-01 (to M.T.) and Biotechnology and Biological Sciences Research Council grant BB/F010486/1 (to M.T.). Funding for open access charge: Canadian Cancer Society Research Institute.
Conflict of interest. None declared.
The authors thank LeAnn Howe and Francis Ouellette for helpful discussions.