Phosphorylation sites are often found in intrinsically disordered regions of proteins (34
), which usually cannot be experimentally determined by X-ray crystallography. However, in a number of cases, they lie on globular domains whose sequence can confidently be mapped onto X-ray determined structures.
For the latter sites, accessibility to the solvent can be calculated. Currently we have been able to assign an accessibility value to 3% of all the sites in Phospho.ELM (1281 of 42
574 instances) and we anticipate that this number will increase in parallel with the increase in solved structures. These data are particularly relevant for bioinformaticians who develop computational methods to predict kinase substrates. Because of the transient nature of phosphorylation events, phosphorylation sites tend to lie on the surface of proteins. Many studies [see Via et al.
) for a summary] have shown that the substrate specificity is not only dependent on the primary sequence of the motif hosting the phosphorylation site, but also on its structural conformation.
The correspondence between a Phospho.ELM sequence and an X-ray Protein Data Bank (PDB) structure (36
) is based on sequence alignment using at least 98% global sequence identity and 100% identity at the phosphorylation site. When more than one PDB structure corresponded to a single Phospho.ELM sequence, one with the lowest resolution was retained. Whenever a site can be mapped onto a PDB structure, its solvent accessibility (SA in Å2
) is taken from DSSP (37
) and the corresponding percentage is obtained by normalizing the SA to the phospho-residue accessibility maximum value [as determined in (38
Furthermore, we have also integrated protein surface accessibility data and links to structural data (when available) obtained from the Phospho3D database (39
). For details on the structure, one can follow the link to PDBe (40
The accessibility data, however, should be interpreted in the context of the structure. For example, a low accessibility value (of ~18%) is reported in the Phospho.ELM entry for the human Src (UniProt P12931) tyrosine 530, which is a well known substrate of the CSK kinase. This is due to the fact that the structure used to calculate the accessibility has been determined in a closed Src conformation, where the phosphorylated tail binds to the SH2 domain. In general, when evaluating the SA of an instance, the user is advised to be aware of the instance molecular context. Note that in most cases, the best resolution structures are not in the phosphorylated conformation (i.e. with the phosphate moiety attached) or, as in the Src example, they are in a phosphorylated but closed, inactive conformation. In particular, if a site becomes available to its cognate kinase only as a consequence of a conformational change, this might be reflected in a (transiently) low accessibility value.
In the great majority of the cases (~97%), either an X-ray reference structure is not available for a Phospho.ELM sequence or a structure can be found but the phosphorylation site falls in an unresolved (disordered) region of the structure. In these cases, we provide the users with predicted accessibility values. The SA predictions were carried out using the real-SPINE integrated system of neural networks (41
The accessibility score is shown as the last column in the HTML output () and, when provided, it is linked to the Phospho3D resource (http://arianna.bio.uniroma1.it/phospho3d
). The user is encouraged to investigate this link to gain more insight into the structural features of the particular protein including a 3D representation of the instance as well as a comparison of all PDB entries available for that instance. The PDB entry that was used to calculate the score, is listed in the second last column of the HTML output and is linked to the PDBe resource at the EBI where different viewers are available for closer inspection of the site.
In addition to providing more evidence about the structural properties of the region surrounding the phosphosites, we determine if the sites reside within domains annotated in the SMART resource (42
). Furthermore, for each phosphosite, a probabilistic score was calculated ranging from 0 (complete order) to 1 (complete disorder) using the IUPRED intrinsic order–disorder predictor (43
). The IUPRED algorithm uses the parameter ‘long’ and a window of 21 residues for smoothing the score. In the HTML output table, IUPRED scores below 0.5 (predicted ordered) are colored in grey while IUPRED scores above 0.5 (predicted disordered) are colored in black. shows the distributions of IUPRED scores for instances that reside either within (upper panel) or outside (lower panel) known SMART domains. Outside the known domains, the sites are strongly skewed to the native disorder values, reaffirming the earlier analyses (34
). These curves may help in understanding the nature of cell regulation as they imply that most protein phosphorylation explicitly modulates protein–protein interactions in dynamic regulatory systems, rather than through allosteric regulation of the shape of the modified protein, although this is clearly an important function of the less abundant in-domain sites.
Figure 4. Histograms of IUPRED Score of Phospho.ELM instances within and outside of known domains. Instances with an IUPRED score above 0.5 are predicted to be in a region of polypeptide sequence that is intrinsically disordered (i.e. cannot fold into a stable (more ...)