Data collection, annotation and sequence redundancy reduction
We compiled all PTMs available at UniProt (The UniProt Consortium, 2010
), dbPTM (Lee et al, 2006
), PHOSIDA (Gnad et al, 2010
), PhosphoSite (Hornbeck et al, 2004
), HPRD (Keshava Prasad et al, 2009
), OglycBase (Gupta et al, 1999
) and PhosphoELM (Dinkel et al, 2010
). UniProt predictions labeled as inferred ‘by similarity' or ‘potential' were not included in the data set, nor in the analysis. For each modification, we recorded the protein id, type of modification, organism, amino acid and sequence position.
After collection, performed id and OG mapping and retrieved the protein sequences. We performed a pre-processing task to discard sequences that are identical or overlapping (e.g., peptides into protein sequences). The STRING database v8.3 (Jensen et al, 2009
) is used for the id mapping and sequence retrieval. Sequences with no exact match to the corresponding sequence in the OG were excluded from the analysis. From the resulting number of PTMs and species present in the filtered data set, we selected only those sets, PTM types and organisms, with enough data to perform an accurate statistical analysis (see next point).
The selected proteins were annotated to their cellular locations according to Uniprot cellular component keywords ontology. The protein function was inferred from their OGs (Tatusov et al, 1997
) as provided by the eggNOG database (Muller et al, 2010b
Selection of PTM types to include in the analysis
In the collection of PTMs we found a large set of PTM types with only a few residues annotated in the databases. To assess the lower limit of the number of residues that a PTM type should have to be able to be detected significantly co-evolving with any other PTM type, we performed a simulation to test the capacity of our statistical framework to detect global co-evolution between two PTM types.
We first calculated the average ratio of modified residues versus non-modified residues (background) measured in all comparisons (0.13) and the average ratio of pairs of residues with significant co-evolution versus the pairs not found co-evolving for both sets: modified residues and non-modifies residues (1.53 and 0.45, respectively). Using these proportions we simulated the results (number of co-evolving residues versus not co-evolving residues in modified and non-modified residues) that are the input for the Fisher test that calculates whether a pair of PTM types is globally co-evolving compared with comparable non-modified residues. This type of analysis was performed for 50 simulated PTM types (each of them increases its number of residues in one, from 1 to 50 residues). We introduced two variables in these simulations: (i) the number of residues with other PTM type that are present in the same protein of at least one residue of the simulated PTM type (overlapping residues), we performed simulations for 1 to 8 overlapping residues and (ii) the number of not co-evolving modified residues of the simulated PTM type (from 0 to the number of residues overlapping).
Per each of the rounds of simulations (50 simulated PTM types with x
number of overlapping residues and y
number of not co-evolving residues), we generated the input for the Fisher test and added to the pipeline that gets adjusted P
-values (by false discovery rate (FDR)) including all the results for the comparisons of the larger PTM types (13 in total) to simulate a real scenario in the adjustment of the P
-values. See results in Supplementary Figure 1
Evaluation of the number of PTMs and unique PTM types per protein
We built the expected distributions of the number of proteins with a particular number of PTMs and PTM types by randomly assigning the modifications we have in our data set to the proteins in each of the species in the study. We used the non-parametric Kolmogorov–Smirnov test to compare the expected distributions with the distributions extracted from our data set. We used Fisher exact test to evaluate whether the number of proteins with more than one PTM type in our data set is more than expected. See results in Supplementary Figure 3
Functional enrichment analysis of protein with a specific PTM type
We analyzed the enrichment of protein functions and cellular location as well as type of protein region where the modified residues are located (i.e., ordered or disordered regions) for each set of proteins with a particular PTM type. The functional classifications for the proteins were obtained from metazoa OGs (meNOGs) from the eggNOG database, as a consensus between coverage and wealth of annotation. The reference set for the functional enrichment was the whole set meNOG functional classification. As mentioned before, we used Uniprot annotation for deriving protein location. We used the whole set of proteins from Uniprot from the 8 selected species from which we study the PTMs as the reference set for the location enrichment analysis. We also annotated the protein regions (ordered or disordered regions) where the modifications are placed and performed an enrichment analysis of any of these two categories in the global distribution of ordered and disordered region of the proteins belonging to the species under study. The DisEMBL algorithm v.1.4 (Linding et al, 2003
) was used for the detection of disordered regions within proteins using COIL definition for the disordered regions.
For all the comparisons, we applied a Fisher exact test with P-values adjusted by FDR for the whole set of tests. Adjusted P-values below 0.05 were taken as significant.
We built a phylogenetic species tree based on the NCBI taxonomy, which is known to be accurate for most taxa (Benson et al, 2010
; Sayers et al, 2011
), and inferred branch lengths. To assess the branch lengths, we generated alignments of 40 ubiquitous, single copy marker genes (Ciccarelli et al, 2006
) for 853 different species using AQUA (Muller et al, 2010a
) and combined the tree topology of the NCBI taxonomy tree with them using PhyML (Guindon et al, 2010
). The resulting tree was manually curated and genomes that had an erroneous placement in the NCBI taxonomy tree were removed. The final tree includes 851 taxa including 35 eukaryotes, 43 archaea and 773 bacteria. Of the eukaryotic taxa included in the tree, 17 were metazoan (8 mammalia; 3 primates) and 13 were fungi.
Mapping proteins to OGs and tree generation pipeline
The database eggNOG v2.0 was used to map every protein in the data set to the oldest eukaryotic OG in which the protein is present and generated a multiple sequence alignment (MSA) and sequence tree using PhyML v.3.30. TreeKO (Marcet-Houben and Gabaldón, 2011
) algorithm was used to root the tree and decompose the sequence tree into a set of all possible pruned trees with no duplications and in consonance with the species tree topology. The set of pruned trees includes all combinations of splitting the duplication events so that there are no trees with paralogous sequences but at least one tree for each paralog. The branch lengths for this tree are derived from the species tree. For all trees in this set that include the reference sequence, the residue RCS (see next section) is calculated for the modified residue. The tree in which the modified amino acid presents a higher score is selected for the evaluation.
The RCS for amino acid aa (RCSaa) evaluates the conservation of a particular amino acid (aa) within its position in the MSA of the corresponding OG. As explained in previous section, the resulting tree generated from the OG has been reconciled and has branch lengths information based on the species tree. The RCSaa is composed by two components, the RCR, which represents the occurrence of the amino acid in the exact position in the sequences present in the pruned tree generated from the MSA and the MBL, which is the maximum branch distance between any pair of the species represented in the tree with the modified residue present in the aligned position.
To avoid that the size of the OG affects RCS, we get the oldest common ancestor of any two pairs of sequences with the amino acid conserved and only the descendants of this common ancestor are taking in consideration for the calculation. Thus,
is the number of times aa appears in that position.
Reference distribution calculation for the normalization of the RCS
As the overall conservation status of the protein can indeed be a bias for the measurement of the conservation of the modified site, we generated a specific reference distribution of conservation of non-modified residues for every modification in the data set in order to normalize its RCS. To build the reference distribution for a particular modified site, we calculate the RCS for all non-modified residues from the OG that are of the same type of amino acid as the one with the PTM and it is placed in the same class of protein region (ordered or disordered). The RCS of the modified site is then mapped into the reference distribution to calculate the percentile of its value, this percentile is what we name relative RCS (rRCS). Only those modifications with >10 values in the reference distribution were selected for the conservation analysis.
Comparison of conservation distributions
We used the non-parametric Kolmogorov–Smirnov test to calculate the statistical significance of two distributions of RCSs or rRCSs.
In order to obtain a ranking of the overall conservation status of PTM types and function/location specific sets of PTMs, we calculate the mean of all rRCSs for every of these sets.
Extraction of co-evolving PTM pairs
We used the MI algorithm (Cover and Thomas, 1991
) to evaluate the co-evolution of two PTMs in the same protein. The MI of two variables (Y
) is measured as
) is the joint probability of X
, and p1
) and p2
) are the probabilities of X
, respectively. As MI measures the dependency of two variables, the PTM pairs modifying exactly the same residue within the protein were excluded from the analysis. We also excluded proteins which OG have <7 species and modified sites who are not conserved in at least 4 species and at most in the total number of species in the OG minus 3. MI values representing anti-correlation were converted to negative values to allow this measurement to distinguish between cases were co-evolution is real (common co-occurrences of sites in the same species) and cases were the presence of the site in the species set is complementary (high MI value but low co-evolution), see .
In every protein we built a set of different PTM pairs with all the modified residues present in the sequence. In order to avoid the effect of the possible co-evolution of PTMs of the same type increasing the global co-evolution measurement of a pair of PTMs due to multiple cross-evaluation, for a pair of PTM types, we only allow a particular modified residue to be present in a single pair of PTMs, the selection is done randomly. We then calculate the MI for the set of pairs selected and per each MI calculated we permute the species labels of one of the two residues 100 times and calculate again its MI. The permutations generate a reference distribution where the MI can be mapped in with a non-parametric approach. If the percentile of the modified residue is above 95 we assume the residues are co-evolving to a significant degree. The same approach is applied to a set of random pairs of non-modified residues selected under same conditions, same amino acids and same protein region distributions.
To evaluate the global co-evolution of two types of PTMs (as well as self co-evolution), we performed a Fisher exact test to see whether the ratio of the modified residues above 95 percentile on their reference distributions is significantly higher than the same ratio in non-modified residues. As the selection of the random pairs of residues can affect the result of the Fisher test, we repeated the whole process 100 times. P-values were corrected for FDR and value <0.05 taken as significant. The number of times we found that a pair of PTM types is significantly associated (maximum 100 and minimum 1) is then used to evaluate the coverage of the co-evolving pairs of PTMs within the whole set of pairs with the same type of PTMs.
Comparison of conservation between co-evolving and non-co-evolving modified residues
We compared by a Fisher exact test the number of times the rRCS of the modified and co-evolving residues is greater and lower than 95% against the same parameters for the modified but no co-evolving residues.
Preferred functionality for a set of proteins
To extract the preferred functions of the set of proteins with at least one pair of co-evolving modified residues, we used the annotation provided by the proteins OGs in the metazoa level (meNOGs) from the eggNOG database. A functionality is classified as preferred when the number of proteins annotated with such term is above the number of proteins annotated to any of the cellular locations present in the set as expected by chance plus 10 percent.
Preferred cellular location for a set of proteins
To extract the preferred location of the set of proteins with at least one pair of co-evolving modified residues, we used the cellular component annotation extracted from Uniprot (see above). A cellular location is classified as preferred when the number of proteins annotated with such term is above the number of proteins annotated to any of the cellular locations present in the set as expected by chance plus 10 percent.
Gene Ontology enrichment analysis
We performed a Gene Ontology terms enrichment analysis to every set of proteins with a particular pair of co-evolving PTM types using the FatiGO (Al-Shahrour et al, 2004
) software available at the Babelomics suite (Medina et al, 2010
). We first filtered the sets of proteins with co-evolving pairs of PTM types to include only human proteins (by far the most abundant in the data set). We did the same filtering to proteins sets with the same type of modifications but found not co-evolving and we used them as respective background lists. We evaluated the enrichment of Gene Ontology terms in biological processes, molecular function and cellular component categories restricted to a propagated annotation in level 5 (as a good compromise between the detail and the amount of the annotation retrieved) and using a two-tailed Fisher exact test for the enrichment analysis. P
-values are adjusted by FDR and significant values are taken as <0.05.
Enrichment analysis in protein–protein interaction networks
We used the software SNOW (Minguez et al, 2009
) available at Babelomics suite (Medina et al, 2010
) to evaluate the topological parameters of the protein–protein interaction networks formed by the set of proteins with a particular type of co-evolving pair of PTM types. As in the GO enrichment analysis previously described, we filtered the proteins sets to include only human proteins. We generated two types of background so the analysis was performed twice (all results shown), first as in GO enrichment analysis we generated the set of human proteins with same type of modifications but not co-evolving and the second type of background was offered by the SNOW software as a set of 500 random sets of proteins (with the same number of proteins as the query set). The minimal connection network (MCN) in both comparisons was generated allowing a single external protein to link proteins in the set and taking protein–protein interactions from the curated set SNOW provides. P
-values are adjusted by FDR and significant values are taken as <0.05.
Analysis of the proximity of the co-evolving PTMs
We measured the distance between PTM sites as the number of intervening residues and in the available protein structures, in Ångströms, of every pair of co-evolving amino acids for each pair of PTM types. To have a reference distribution for each pair of PTMs, we randomly selected the same number of amino acids with the same PTM types, but found not co-evolving, from the same proteins and measure their distance in the same way. Both, the distribution of distances in the modified residues and the reference are then compared applying Kolmogorov–Smirnov test. P-values are adjusted by FDR and significant values are taken as <0.05. As the samples are small and the random sets can alter the final results, we performed the analysis 100 times and at the end count the number of times the difference was found significant.
Enrichment analysis of the interaction between PTMs and protein domains
For every pair of PTM types with certain degree of co-evolution (red lines at ), we extracted the Interpro domains (Hunter et al, 2009
) with one of the modifications inside. We used Interpro domains instead of more curated databases, such as SMART (Letunic et al, 2012
), to be able to include a wider definition of globular domains. We also measured the number of times that a particular modification was found inside the same domains when it is present in isolation or co-evolved with any other PTM type. We performed an enrichment analysis using Fisher exact test for each domain to test the difference between these two ratios. P
-values were adjusted by FDR for the whole set of analysis and values <0.05 were taken as significant.
Extracting short-linear motifs enriched in sets of proteins with co-evolving PTM pairs
We performed an enrichment analysis of short-linear motifs, of 3–10 residues, for every set of proteins with a particular pair of co-evolving PTMs using the SliMFinder (Davey et al, 2010
) software. Default parameters were used for this analysis. The extracted linear motifs were compared using CompariMotif (Davey et al, 2007
) software to known motifs belonging to all databases available in SlimFinder.
Identification of functions enriched in proteins with co-evolving methylated and phosphorylated residues and DLF motif
We performed a functional enrichment analysis with the proteins having crosstalking methylated and phosphorylated residues and the DLF short-linear motif using the FatiGO software (Al-Shahrour et al, 2004
) available within the Babelomics suit of tools (Medina et al, 2010
). We used Gene Ontology terms as protein annotation and chose an adjusted (by FDR) P
-value of 0.05 as the threshold for significance.