In order to construct an improved predictor, we created a new database of P- and NP-sites by extracting annotated and unannotated S, T and Y sites from SWISS-PROT and combining them with the sites from PhosphoBase, the database used to train NetPhos (12
). We will use the term ‘site’ to indicate the phosphorylation site itself and the 24 flanking residues, 12 residues upstream and 12 residues downstream from the actual phosphorylation site. Our choice of the 25 amino acid window was based on experimental data for kinases that are known to contact 7–12 residues adjacent to the site of modification (38
First, we evaluated the accuracy of the NetPhos predictor using the P- and NP-sites derived from SWISS-PROT, excluding the ones that had >30% identity with the sites from PhosphoBase. The prediction accuracy of NetPhos (as defined in Materials and Methods) on this out-of-sample set of 107 serine, 36 threonine and 43 tyrosine positive sites combined with 500 negative sites for each residue, reached 69.1% for S (sensitivity 81.3 ± 3.7%, specificity 56.8 ± 2.2%), 71.9% for T (sensitivity 66.7 ± 7.8%, specificity 77.2 ± 1.9%) and 68.9% for Y (sensitivity 69.8 ± 7.0%, specificity 68.0 ± 2.1%). Note that the accuracy of NetPhos on PhosphoBase sites shown in the original paper (12
) was calculated under the assumption of 100% specificity, which is clearly not the case.
Next, we estimated the accuracy of Scansite (14
). Since the profiles for Scansite are not publicly available, we tested its accuracy using 100 positive and 100 negative examples for each S, T and Y residue randomly selected from our data sets. We used medium stringency for the evaluation as a reasonable balance between the sensitivity and specificity of Scansite. The prediction accuracy estimated for Scansite was 64.0% for S (sensitivity 38.0 ± 4.9%, specificity 90.0 ± 3.0%), 66.5% for T (sensitivity 43.0 ± 5.0%, specificity 90.0 ± 3.0%) and 68.5% for Y (sensitivity 49.0 ± 5.0%, specificity 88.0 ± 3.2%).
All subsequent experiments were performed on the combined data sets consisting of the sites from both SWISS-PROT and PhosphoBase (Table ). A detailed description of the data set construction is given in Materials and Methods.
Phosphorylation and non-phosphorylation sites used in the current study
Analysis of amino acid properties surrounding phosphorylation sites
We analyzed the properties of amino acids surrounding each site and determined which residues were enriched or depleted at specific positions. Figure shows the residues for which the observed differences in relative frequencies between P- and NP-sites are statistically significant (P
< 0.05). Statistical significance was calculated using Fisher’s permutation test (29
). For the S, T and Y sites we observed 164, 56 and 84 compositional differences, respectively, between P- and NP-sites with P
-values <0.05. A positive difference signifies that the P-sites are enriched in the corresponding residue, while a negative difference signifies depletion. Each residue was assigned a property: surface or buried (28
) (Fig. A), charged or neutral (Fig. B), hydrophobic or hydrophilic (39
) (Fig. C), high or low flexibility parameters (HFP and LFP, respectively) (27
) (Fig. D). The percentages of residues enriched and depleted in each category are shown in Table .
Figure 1 The amino acid residues significantly enriched and depleted around phosphorylation sites. Each residue is assigned a property: surface (red) or buried (black) according to Janin’s scale (28) (A), charged (black) or neutral (green) (B), hydrophobic (more ...)
The percentages of amino acid residues that are significantly (P < 0.05) enriched and depleted around phosphorylation sites
All three types of phosphorylation sites are clearly enriched in surface exposed residues and depleted in buried residues (Fig. A and Table ). Interestingly, the most prevalent amongst the surface residues surrounding serine are serine, lysine, arginine and glutamic acid, with glutamic acid found predominantly downstream from the phosphorylation site. The enrichment in serine agrees with the results of previous analysis (12
) and confirms the observation that serines tend to cluster. Another distinctive characteristic of the serine sites is the depletion of cysteine, leucine, isoleucine and the aromatic residues. The decreased frequency of leucine and isoleucine is also observed around T and Y sites. The abundance of aspartic and glutamic acids in close proximity to the Y site, and proline at the positions distant to Y, is a signature of tyrosine phosphorylation sites.
The analysis of the charged versus neutral residues surrounding phosphorylation sites suggests that all three sites are strongly depleted in the neutral residues (Fig. B and Table ), and S and T are slightly enriched in the charged ones. Among the residues that are under-represented around S, T and Y sites, neutral residues constitute 94, 93 and 87%, respectively (Table ). Interestingly, the reciprocal distribution of charged versus neutral amino acids does not hold for Y: this site is both enriched and depleted in the neutral residues. The high frequency of uncharged proline around Y sites might explain this result.
The partition of the residues into hydrophobic and hydrophilic classes shows that all three sites are depleted in the hydrophobics (Fig. C and Table ), with S (87%) and T (75%) being especially enriched in hydrophilics. The under-representation of hydrophobics is mainly due to the depletion of leucine, isoleucine and aromatic residues around all three sites.
We used a flexibility scale (27
) that is based on B-factor values to evaluate the distribution of amino acid residues with HFP and LFP (Fig. D and Table ). Following the definition of Vihinen et al
), alanine and threonine were considered to have HFP if flanked by residues with HFP, and they were considered to have LFP if surrounded by residues with LFP. The prevalence of HFP residues and the depletion of LFP residues around all sites are observed; 100, 93 and 84% of residues enriched around S, T and Y sites, respectively, belong to the HFP category (Table ). The prevalence of HFP residues around potential phosphorylation sites may facilitate accessibility of the site to the kinase.
The distribution of the disorder-promoting (R, K, E, P and S) and order-promoting (C, W, Y, I and V) residues (17
) around phosphorylation sites suggests that all sites have amino acid distributions characteristic for intrinsically disordered protein regions. The disorder-promoting amino acids are significantly enriched, and order-promoting amino acids are significantly depleted around all P-sites in our data sets. This observation indicates that the addition of the disorder feature may be important for phosphorylation site prediction.
In addition to the overall comparisons (Table ), we examined amino acids and positions that show the largest signals (or lowest P-values) in Fisher’s permutation test. The top 10 enrichments and the top 10 depletions of such residues are shown in Table . For serine, the P-values for the six most enriched and the two most depleted amino acids exhibit the greatest difference between P- and NP-sites with P < 0.0001, while for T the P-values for all depleted amino acids show smaller significant differences with P > 0.001. As for Y, the five most enriched amino acids also show the greatest differences between P- and NP-sites with P < 0.0001. Thus, the presence of particular amino acids at certain positions appears to be more important than the opposite, although the absence of particular amino acids is still statistically significant in all cases.
Table 3. The top 10 amino acids enriched and depleted around known phosphorylation sites as determined by Fisher’s permutation test (29)
Interestingly, several positions that are distant from the potential phosphorylation site also have significant differences between P and NP-sites, for example, position –11 for S, position –12 for T and position +9 for Y (Table ). This result suggests that although the recognition patterns for the currently known kinases usually involve only positions in close proximity to the potential phosphorylation site (usually within –5 to +5), the more distant positions may also be important. Thus, the kinase recognition patterns may be extended in the future as new kinases are discovered.
Sequence complexity of P-sites versus disordered protein regions
One feature that distinguishes disordered protein regions from ordered is sequence complexity K2
measured by Shannon’s entropy and applied to amino acid sequences by Wootton (23
). We discovered previously that the SWISS-PROT database is characterized by higher amounts of both low-complexity segments and predicted disordered regions in comparison to PDB (17
). Here we determined the sequence complexity K2
for the P- and NP-sites, and compared them to the K2
of ordered and disordered protein regions (Fig. ). The K2
distribution for the phosphorylation sites is very similar to the K2
distribution for the disordered segments, whereas the K2
for the NP-sites is very close to the K2
for ordered globular segments. Moreover, the differences in cumulative percentages for P- and NP-sites (Fig. , inset) show that from 1.4 to 5 times more P-sites have K2
values between 2.9 and 3.8, indicating the enrichment of P-sites in low complexity segments. The correlation of low K2
with phosphorylation is an interesting and important result of this analysis.
Figure 2 Sequence complexity distributions. Sequence complexity K2 was calculated for the sliding window of 45 residues. The data set ‘all disorder’ consisted of disordered regions characterized by X-ray diffraction (extracted from PDB-Select-25), (more ...)
Amino acid compositions of P- and NP-sites and disordered protein regions
Disordered and ordered protein regions differ in amino acid composition, with W, C, F, I, Y, V, L and N being significantly depleted and A, R, G, Q, S, P, E and K being significantly enriched in regions of disorder (17
). We compared amino acid compositions of P- and NP-sites and disordered protein regions (Fig. ). The results are presented as the difference between the composition of each data set and the composition of ordered globular protein regions: (Cdata set
Figure 3 Comparison of amino acid compositions between disordered protein regions, P- and NP-sites. The composition for each data set is shown in comparison with the ordered Globular-3D data set. The results are presented as the difference between the composition (more ...)
Overall, the amino acid compositions of P-sites and disordered regions are very similar. In particular, they are both significantly depleted in rigid, buried, neutral amino acids (W, C, F, I, Y, V and L), and are significantly enriched in flexible, surface-exposed serine, proline, glutamic acid and lysine (Fig. ). Interestingly, we observed >3-fold enrichment in serine for phosphorylation sites as compared with both non-phosphorylation sites and disordered protein regions. The extreme positive S peak for P-sites yet again suggests the tendency of serines to cluster and the possibility of sequential phosphorylation at multiple sites.
Predictor construction and accuracy estimation
We first used only position-specific amino acid frequencies (Materials and Methods) to construct a predictor for each site and to estimate its accuracy. The purpose of this experiment was to compare the accuracy achieved by DISPHOS with the accuracy of NetPhos (12
) when the same types of features over the same window size were used. The major differences between these two predictors originate from the data set construction, model choice, feature selection/extraction, and training and testing processes, as discussed below. We also used the initial predictor developed on the balanced training set to iteratively estimate the class priors in various groups of proteins. These estimates, in turn, enabled us to adjust the outputs of the initial predictor thus minimizing the total number of misclassified residues in each protein group.
We expanded the positive and especially negative data sets (Table ) by combining phosphorylation sites derived from two databases, PhosphoBase and SWISS-PROT. Three new data sets (samples) corresponding to S, T and Y sites, respectively, were then constructed (Materials and Methods). After the feature construction process, all data sets were high-dimensional and sparse (due to data representation), noisy (due to possible mislabeling of positive and especially negative sites), and highly imbalanced (with a large number of negative versus small number of positive sites). The ratios of positive versus negative sites were 1:18, 1:65 and 1:38 for S, T and Y, respectively. Additionally, only small sets of confirmed positive sites were available for T and Y after removing similar sequences (Table ).
Before model training we performed dimensionality reduction using a feature selection process (Materials and Methods) and principal component analysis (PCA). Then, following the approach of Radivojac et al
), we applied a bagging-like combination of logistic regression models (Fig. ) to construct a predictor. Therefore, we carefully approached the problem of class imbalance that usually requires special treatment since predictor performance may be highly degraded (42
). Neural-network based predictors did not surpass the accuracy of the bagged linear models.
The process of model building and testing.
When only significant position-specific features were used, the best accuracy for S (74.9 ± 0.3%) was achieved using a dimensionality of 40 and a P-value of 0.1 in the feature selection process (Table ). This accuracy exceeded that of NetPhos on the out-of-sample data set (69.2%) indicating significantly better performance of DISPHOS when the larger data set and the refined set of position-specific features were used. Similar results were observed for predictors using 30 dimensions and P-values of 0.1 for T and Y: 78.9 ± 0.6% versus 71.9% for T, and 81.3 ± 0.5% versus 60.6% for Y. Without ensembles of classifiers, i.e. for I = 1, the accuracy drops by two to three percentage points in each case.
Comparison of the accuracies (% ± 95% confidence intervals) between the NetPhos and DISPHOS
The addition of non-binary features resulted in a further increase in the accuracy of DISPHOS. Using the forward-selection algorithm (31
), we iteratively added new real-valued features to the sample until no increase in predictor accuracy was observed. Four rounds of feature selection were experimentally determined to be sufficient for each model. In every round only the feature that increased accuracy the most was chosen, and only the remaining features were used as candidates in the next round. The best features with their accuracies are summarized in Table .
As expected, the disorder feature increased the prediction accuracy for serine and threonine (Table ). In a few instances, the relative frequency of an amino acid was also selected as the best feature (Table ). Relative frequency features reflect the presence or absence of a particular amino acid anywhere in the window of 25, and not only at a specific position. Interestingly, none of the non-binary features from Figure were selected as the best for increasing predictor accuracy, most likely because they were already implicitly included in the predictor through disorder and position-specific features.
We report sensitivity and specificity of DISPHOS to be in the range from 76% for serine to 83% for tyrosine. However, the positive and negative examples were not labeled with the same confidence. Some of the negative examples were likely mislabeled due to the lack of experimental verification. This noise may have lead to underestimated predictor specificities. Using the approach described in Materials and Methods, we estimated the amount of noise in the negative data sets to be ~14% for serine, ~5% for threonine and ~8% for tyrosine. Subsequently, the correction formula provided us with more accurate specificities. We believe that the true specificities for the S, T, and Y predictors are ~85, ~84 and ~88%, respectively, pushing the overall prediction accuracy to 80% for serine, 82% for threonine and 86% for tyrosine. Similar corrections applied to NetPhos and Scansite would increase the accuracy of these methods as well. Hence, an immediate step towards improving the results of this study should be acquiring (or selecting) better quality data that will enable us to obtain (i) improved phosphorylation predictors and (ii) more accurate estimates of sn and sp, both leading to higher quality estimates of phosphorylation in nature.
Prediction of phosphorylation sites in protein functional categories from SWISS-PROT
We previously predicted disorder in various protein functional categories extracted from the SWISS-PROT database (3
). Here, we predicted phosphorylation sites in the same protein data sets using DISPHOS and a new estimation method (Materials and Methods). The estimated percentages of predicted S, T and Y phosphorylation sites for each data set and for disordered and ordered protein regions are shown in Figure .
Figure 5 Estimated percentages of S, T and Y phosphorylation sites in 12 functional protein categories from SWISS-PROT and in disordered and ordered data sets. The y-axis indicates the estimated percentage of phosphorylated S, T or Y residues in each data set. (more ...)
The most interesting result is that we predict over 10 times more phosphorylation sites in completely disordered proteins than in ordered proteins from PDB, which strongly supports the idea of a tight interconnection between protein phosphorylation and disorder. Moreover, the number of predicted phosphorylation sites in Figure correlates highly with the amount of predicted disorder (3
) in all protein categories.
Previously we showed that regulatory, cancer-associated and cytoskeletal proteins have about twice as much predicted disorder than proteins involved in degradation, biosynthesis and metabolism (3
). Here, we observed very similar but even more profound differences in the percentages of predicted serine phosphorylation sites for the same protein data sets: 57.4 ± 0.7, 42.2 ± 1.1 and 40.6 ± 0.9% for regulatory, cancer-associated and cytoskeletal proteins versus 10.8 ± 1.7, 8.7 ± 0.6 and 5.9 ± 1.1% for degradation, biosynthesis and metabolism data sets, respectively (Fig. ). These differences support the hypothesis that proteins involved in regulatory and signaling cellular functions undergo more frequent phosphorylation/dephosphorylation than proteins involved predominantly in catalysis.
Prediction of phosphorylation sites in various kingdoms and proteomes
It is well known that regulation in bacteria occurs via histidine or aspartic acid phosphorylation, involved in two-component signaling pathways (43
). However, the putative homologs of eukaryotic protein serine/threonine kinases and phosphatases, found in bacteria and archaea, suggest the possibility of alternative eukaryotic-like signal transduction pathways in these species (44
). Therefore, we were interested in comparing the DISPHOS predictions on viral, archaeal and bacterial proteins with predictions on proteins from seven complete or almost complete eukaryotic genomes (Table ).
Prediction of phosphorylation sites on proteins from SWISS-PROT and on seven entire proteomes
The percentages of predicted S, T and Y phosphorylation sites in each data set are shown in Table . We predict that on average from 18 (Caenorhabditis elegans) to 32% (Drosophila melanogaster) of all serine residues in eukaryotic proteomes may become phosphorylated, whereas this number ranges from only 2% in bacteria to 7% in viruses. Likewise, the percentage of predicted threonine phosphorylation sites in most eukaryotic proteomes (5–9%) exceeds those in bacteria (2%) and archaea (2%). We estimate that viruses and yeast have equal percentages (3%) of threonine phosphorylation sites. Interestingly, the estimated percentage of threonine phosphorylation sites in higher eukaryotes exceeds those in yeast, worm and plant proteomes. The percentage of tyrosine phosphorylation sites in all data sets is very similar and ranges from 4 to 5% in prokaryotes, viruses and archaea, and from 6 to 8% in eukaryotes (Table ).