|Home | About | Journals | Submit | Contact Us | Français|
The Phospho.ELM resource (http://phospho.elm.eu.org) is a relational database designed to store in vivo and in vitro phosphorylation data extracted from the scientific literature and phosphoproteomic analyses. The resource has been actively developed for more than 7 years and currently comprises 42574 serine, threonine and tyrosine non-redundant phosphorylation sites. Several new features have been implemented, such as structural disorder/order and accessibility information and a conservation score. Additionally, the conservation of the phosphosites can now be visualized directly on the multiple sequence alignment used for the score calculation. Finally, special emphasis has been put on linking to external resources such as interaction networks and other databases.
Over the past few years, many advances have been made in mass spectrometry techniques and protein enrichment strategies that have significantly improved the detection efficiency of phosphorylated proteins (1,2). Consequently, steadily increasing numbers of phosphorylated peptides are being reported from mouse and human cell lines as well as tissue samples (3,4).
However, the knowledge of the phosphorylated sites per se is neither sufficient to identify how signals are propagated into cells nor adequate to define the complexity of the intracellular networks. To fully appreciate the relevance of phosphoproteomic approaches it is essential to gain additional knowledge about the biological conditions under which the phosphorylation occurs, to identify the enzymes (kinases and phosphatases) that switch ‘on and off’ their substrates, and to understand the functional consequences that these modification events have on cellular processes.
Amino acid phosphorylation is probably the most abundant of the intracellular post-translational protein modifications used to regulate the state of eukaryotic cells, with estimates ranging up to 500000 phosphorylation sites in the human proteome (5). Is this vast number plausible? It is considered that cell regulatory systems exhibit the property of robustness, but that this vital property cannot be achieved without system complexity (6). Complexity is therefore inevitable and unavoidable, yet it is probable that it has so far been systematically underestimated. However, there are now indications that we are at the dawn of a new and more realistic era in our approaches to signaling research (7). More and more authors are highlighting the importance of factors such as cooperativity, networking, redundancy and decision-making by in-complex molecular switching as we move away from overly linear pathway-based descriptions of cellular systems (8–14). In this context, the efforts to deploy large scale phosphoproteomics to map cellular networks (e.g. (15–17)) can be seen as indispensable to the process of covering more of the signaling network space.
An important aspect to consider is the evolutionary conservation of the phospho-residues. Due to their crucial role in regulating protein function, one could expect phosphorylation events to be conserved among species. However, phosphorylation motifs are short, strongly dependent on the surrounding context (18) and often reside in unstructured and rapidly-evolving regions (19), hence they have been difficult to trace evolutionarily and mixed conclusions have been reported (20,21). Lack of data has limited the possibility for this kind of analysis; only recently has phosphoproteomic data been available from different model organisms, thereby enabling comparative studies of the evolutionary and functional dynamics of reversible phosphorylation across eukaryotes (20). A significant, though not so surprising, observation is that phosphorylated residues are significantly more conserved than equivalent but non-phosphorylated ones (10,22,23).
Since the future knowledge and exploitation of reversible phosphorylation relies on the accessibility of the data, it is of fundamental importance to develop and maintain public repositories to facilitate data retrieval for both wet lab scientists and computational biologists. In this article we describe the content and the more recent features of Phospho.ELM, a manually curated web-based resource dedicated to eukaryotic phosphorylation sites.
The core structure of the database has been retained (24,25) and extended, while new features have been implemented to improve data retrieval and presentation. In addition to a much larger data set, information for the phosphorylated residue, i.e. a conservation score (CS) and the surface accessibility score (either calculated or predicted), have also been included in the update.
The Phospho.ELM data can be accessed by a user-friendly web interface, directly via URL, or programmatically via a XML/Soap Web Service. The user can query the database by keyword or sequence identifier [from UniProt (26) or Ensembl (27)] to get information about single proteins/substrates, or by kinase name to retrieve all phosphorylated substrates of a particular kinase. It is also possible to restrict the query to different taxonomy groups. Table 1 lists all available options for querying the database.
The results page displays all phosphoproteins and phosphorylation sites (instances) meeting the searching criteria; the results tables can be sorted on any column, aiding inspection of the data according to different criteria, such as: the residue number and code, the PubMed references, the sequence (±10) surrounding the phospho-residue and, when available, the upstream kinases as well as the phosphopeptide-binding domains, such as the SH2, 14-3-3 or PTB domains (these data have been annotated for 1250 phosphosites). Furthermore, links to matching kinase recognition motif entries stored in the ELM database (28) is provided when available. An example Phospho.ELM output is shown in Figure 1.
The kinases are currently known for only ~12% of the curated instances. It should be mentioned that this percentage has decreased since the last Phospho.ELM publication (25); this data reflects the current limitation of both experimental and computational methods in assigning the kinase recognizing a given phosphorylation site. Since this information is relevant for gaining insights into the regulation of cellular processes, we provide direct links to NetworKIN (29), a database of predicted kinase–substrate relations.
In addition, we encourage our users to explore the links to other resources that integrate information on signaling networks or protein–protein interactions, such as MINT (30) and STRING (31), to expand their knowledge of the phosphoprotein substrates.
The entire Phospho.ELM data set can be freely downloaded in a tab-delimited format at: http://phospho.elm.eu.org/dataset.html
The current release of the Phospho.ELM data set (version 9.0) contains more than 42500 non-redundant instances of phosphorylated residues in more than 11000 different protein sequences (3370 tyrosine, 31754 serine and 7449 threonine residues).
For each phosphosite we report whether the phosphorylation evidence has been identified by small-scale analyses (low-throughput, LTP) and/or by large-scale experiments (high-throughput, HTP), which mainly apply MS techniques. The Venn diagram in Figure 2 shows the remarkably small overlap between the LTP and HTP phospho-instances.
The majority of the protein instances from Phospho. ELM are vertebrate (mostly Homo sapiens (62%) and Mus musculus (16%)) though 22% are from other species, mainly Drosophila melanogaster (13%) and Caenorhabditis elegans (7%).
In total, more than 300 different kinases have been annotated and a document providing additional information about all kinases annotated in Phospho.ELM can be found at http://phospho.elm.eu.org/kinases.html.
In order to improve the biological understanding of a particular site and thereby indirectly providing the users with additional evidence, we have added information about sequence conservation. This will help researchers to better assess the reliability of the identified sites, especially for those derived from proteomic analyses. For each instance, we have calculated the conservation score (CS) as described in Chica et al. (32).
The conservation of the phosphorylation sites in the database has been calculated using a tree-based approach specifically developed for assessing the conservation of short linear protein motifs (32), also accessible as individual service at http://conscore.embl.de. The method takes into account the presence/absence of the phosphorylated serine, threonine or tyrosine in relation to the global sequence conservation and gives a value between 0 and 1, where 1 indicates conservation in all the homologous sequences at a certain distance from the query sequence, and 0 corresponds to absence of conservation.
The CS of an instance is an estimation of the persistence of a phosphosite during the divergent evolution of a homologous protein sequence set. In a protein-centered view a high CS, together with other contextual information such as high residue accessibility, represents cumulative evidence for the biological relevance of such a site. This is particularly useful when analyzing HTP sites that might be phosphorylated in vitro but not in vivo. The distribution of the CS of manually annotated instances is indeed significantly (P-value <2.2e-16) skewed towards 1, in comparison to that of the instances coming from HTP experiments (Figure 3).
In a protein interaction network context, the CS can be used to suggest evolutionarily stable protein interactions as well as taxa-specific interactions that might have been gained during evolution as regulatory circuits are changed and modulated.
Alignments between the phosphoprotein and the corresponding homologous sequences are available for close inspection of the conservation of the phosphosites of interest in different species (Figure 1, button ‘view conservation’). To this end, the alignment editor Jalview (33) (http://www.jalview.org) has been embedded as a JAVA plugin in the HTML output. Here, known instances are highlighted in different colors according to the phospho-residues (light green for phosphotyrosine, purple for phosphoserine and red for phosphothreonine), while the conservation of the corresponding peptides in the aligned sequences are displayed as dark green columns.
We urge users to look at these alignments, particularly if the CS is low, since there are several factors unrelated to the evolution of the protein sequence, e.g. sequencing errors in not well studied genomes, which could artificially diminish the score even if the site itself is quite conserved across different species.
Phosphorylation sites are often found in intrinsically disordered regions of proteins (34), which usually cannot be experimentally determined by X-ray crystallography. However, in a number of cases, they lie on globular domains whose sequence can confidently be mapped onto X-ray determined structures.
For the latter sites, accessibility to the solvent can be calculated. Currently we have been able to assign an accessibility value to 3% of all the sites in Phospho.ELM (1281 of 42574 instances) and we anticipate that this number will increase in parallel with the increase in solved structures. These data are particularly relevant for bioinformaticians who develop computational methods to predict kinase substrates. Because of the transient nature of phosphorylation events, phosphorylation sites tend to lie on the surface of proteins. Many studies [see Via et al. (35) for a summary] have shown that the substrate specificity is not only dependent on the primary sequence of the motif hosting the phosphorylation site, but also on its structural conformation.
The correspondence between a Phospho.ELM sequence and an X-ray Protein Data Bank (PDB) structure (36) is based on sequence alignment using at least 98% global sequence identity and 100% identity at the phosphorylation site. When more than one PDB structure corresponded to a single Phospho.ELM sequence, one with the lowest resolution was retained. Whenever a site can be mapped onto a PDB structure, its solvent accessibility (SA in Å2) is taken from DSSP (37) and the corresponding percentage is obtained by normalizing the SA to the phospho-residue accessibility maximum value [as determined in (38)].
Furthermore, we have also integrated protein surface accessibility data and links to structural data (when available) obtained from the Phospho3D database (39). For details on the structure, one can follow the link to PDBe (40) (http://www.ebi.ac.uk/pdbe).
The accessibility data, however, should be interpreted in the context of the structure. For example, a low accessibility value (of ~18%) is reported in the Phospho.ELM entry for the human Src (UniProt P12931) tyrosine 530, which is a well known substrate of the CSK kinase. This is due to the fact that the structure used to calculate the accessibility has been determined in a closed Src conformation, where the phosphorylated tail binds to the SH2 domain. In general, when evaluating the SA of an instance, the user is advised to be aware of the instance molecular context. Note that in most cases, the best resolution structures are not in the phosphorylated conformation (i.e. with the phosphate moiety attached) or, as in the Src example, they are in a phosphorylated but closed, inactive conformation. In particular, if a site becomes available to its cognate kinase only as a consequence of a conformational change, this might be reflected in a (transiently) low accessibility value.
In the great majority of the cases (~97%), either an X-ray reference structure is not available for a Phospho.ELM sequence or a structure can be found but the phosphorylation site falls in an unresolved (disordered) region of the structure. In these cases, we provide the users with predicted accessibility values. The SA predictions were carried out using the real-SPINE integrated system of neural networks (41).
The accessibility score is shown as the last column in the HTML output (Figure 1) and, when provided, it is linked to the Phospho3D resource (http://arianna.bio.uniroma1.it/phospho3d). The user is encouraged to investigate this link to gain more insight into the structural features of the particular protein including a 3D representation of the instance as well as a comparison of all PDB entries available for that instance. The PDB entry that was used to calculate the score, is listed in the second last column of the HTML output and is linked to the PDBe resource at the EBI where different viewers are available for closer inspection of the site.
In addition to providing more evidence about the structural properties of the region surrounding the phosphosites, we determine if the sites reside within domains annotated in the SMART resource (42). Furthermore, for each phosphosite, a probabilistic score was calculated ranging from 0 (complete order) to 1 (complete disorder) using the IUPRED intrinsic order–disorder predictor (43). The IUPRED algorithm uses the parameter ‘long’ and a window of 21 residues for smoothing the score. In the HTML output table, IUPRED scores below 0.5 (predicted ordered) are colored in grey while IUPRED scores above 0.5 (predicted disordered) are colored in black. Figure 4 shows the distributions of IUPRED scores for instances that reside either within (upper panel) or outside (lower panel) known SMART domains. Outside the known domains, the sites are strongly skewed to the native disorder values, reaffirming the earlier analyses (34). These curves may help in understanding the nature of cell regulation as they imply that most protein phosphorylation explicitly modulates protein–protein interactions in dynamic regulatory systems, rather than through allosteric regulation of the shape of the modified protein, although this is clearly an important function of the less abundant in-domain sites.
A few years ago it was estimated that more than 100000 phosphorylation sites might exist in the human proteome (44). Recently this number has been corrected upwards to more than 500000 sites (5). This newer value implies an average of approximately 25 sites ‘per protein’ yet it seems quite plausible, given the low LTP/HTP overlap in Figure 2.
Other bioinformatics resources also incorporate substantial phosphorylation data: more general ones are UniProt (26), HPRD (45), and PhosphoSitePlus (46). The latter provides mainly phosphorylation sites from Vertebrata, but also includes data from other PTMs such as acetylation, methylation, ubiquitination, and O-glycosylation. Both PHOSIDA (47) and phosphoPEP (48) are specialized in annotation of large-scale experiments; phosphoGRID for Saccharomyces cerevisiae (49), virPTM for viruses (50) and P3DB for various plants (51) are devoted to specific species. Additional information on phosphorylation resources can be found at the Phospho.ELM link page (http://phospho.embl.de/links.html) and at the GPS compendium of computational resources for protein phosphorylation (52) (http://gps.biocuckoo.org/links.php). In addition, a collection of phosphorylation databases and predictors has been recently published in a review by Via et al. (35). Though we intend to incorporate the most relevant large-scale analyses, we consider our main effort should be on the collection of manually curated phosphorylation sites and related information derived from small-scale experiments on various model organisms. In the near future, we plan to integrate more information on phosphorylation motifs and protein kinase specificity, such as the kinase docking motifs (53). In order to provide an up-to-date and comprehensive resource, we encourage our users to participate in the curation of the Phospho.ELM resource by submitting their own data (http://phospho.elm.eu.org/submit.html).
Novo Nordisk Foundation Center for Protein Research. Funding for open access charge: EMBL, Heidelberg, Germany.
Conflict of interest statement. None declared.
We would like to acknowledge all the Phospho.ELM users who, by reporting missing sites or sending us their data sets, have contributed to improving the database. Many thanks to Leonardo Briganti (MINT database) and Florian Gnad (PHOSIDA database) for technical support, and to Alessandro Barbato for running SA predictions. We are grateful to Norman Davey and Kim Van Roey for critical reading of the article.