|Home | About | Journals | Submit | Contact Us | Français|
Understanding and predicting molecular cause of disease is one of the major challenges for biology and medicine. One particular area of interest continues to be computational analyses of disease-associated amino acid substitutions. To this end, various studies have been performed to identify molecular functions disrupted by disease-causing mutations. Here, we investigate the influence of disease-associated mutations on post-translational modifications. In particular, we study the loss of modification target sites as a consequence of disease mutation. We find that about 5% of disease-associated mutations may affect known modification sites, either partially (4%) of fully (1%), compared to about 2% of putatively neutral polymorphisms. Most of the fifteen post-translational modification types analyzed were found to be disrupted at levels higher than expected by chance. Molecular functions and physiochemical properties at sites of disease mutation were also compared to those of neutral polymorphisms involved in the process of post-translational modification site disruption. Disease-associated mutations in the neighborhood of post-translationally modified sites were found to be enriched in mutations that change polarity, charge, and hydrophobicity of the wild-type amino acids. Overall, these results further suggest that disruption of modification sites is an important but not the major cause of human genetic disease.
Protein post-translational modifications are reversible or irreversible chemical alterations of a protein after its translation. They include covalent additions of particular chemical groups (e.g. phosphoryl), lipids (e.g. palmitic acid), carbohydrates (e.g. glucose) or even entire proteins (e.g. ubiquitin) to amino acid side chains, as well as the enzymatic cleavage of peptide bonds . With some exceptions (e.g. hydroxylation), protein post-translational modifications occur at side chains that can act as either strong (C, M, S, T, Y, K, H, R, D, E) or weak (N, Q) nucleophiles, while the remaining residues (P, G, L, I, V, A, W, F) are rarely involved in covalent modifications of their side chains. Post-translational modifications frequently affect protein function via changes in the protein structure and dynamics. Alternatively, modified residue may be a part of a binding region directly recognized by a partner. For example, phosphotyrosines are known to be directly targeted by the SH2 domains  and acetyllysines are targeted by bromodomains . Similarly, proteolitic cleavage is typically a part of degradation pathways. Biologically, post-translational modifications are involved in a number of activities such as regulation of gene expression, activation/deactivation of enzymatic activity, protein stability or destruction, mediation of protein-protein interactions etc. . Whatever the molecular context, the major role of post-translational modifications is to enable signaling and regulatory mechanisms that modulate protein’s cellular function.
There are more than 200 documented types of post-translational modifications, many of which were discovered only recently . More interestingly, a large fraction of them are catalyzed by modifying enzymes. It is estimated that about 5% of the genes in Homo sapiens are modifying enzymes . There are 518 kinases in the human genome and more than 150 phosphatases . Similarly, the human genome also codes for around 600 E3 ubiquitinating ligases and 80 deubiquitinases . These modifying enzymes are ubiquitous in all kingdoms of life, especially in eukayotes. For example, there are 1019 kinase- and 300 phosphatase-coding genes in Arabidopsis thaliana and even the yeast genome codes for 119 kinases . However, despite the increasing recognition of their importance, the commonness and full functional repertoire of post-translational modifications are still unknown. The focus of this study is on the phenotypic effects of the disruption of post-translationally modified sites by the single amino acid substitution events.
There are a number of cases in which mutations of the post-translational target sites were found to be directly involved in disease. One example is a loss of N-linked glycosylation in the prion protein (PRNP), where amino acid substitution T183A was shown to be involved in autosomal dominant spongiform encephalopathy . This particular variant causes numerous clinical symptoms such as early-onset dementia, cerebral atrophy, and hypometabolism. Interestingly, a wild-type form of PRNP was also found to be protease-resistant in the presence of the mutant. N-linked glycosylation occurs on asparagine residues in NX[ST] motifs, thus the loss of the threonine in the consensus sequence prevents the attachment of a carbohydrate. Modifications of the NX[ST] motif have previously been implicated in intracellular accumulation of PRNP in vitro . Another example is a loss of acetylation sites in androgen receptor (AR). Loss of AR acetylation has been implicated in Kennedy’s disease, an inherited neurodegenerative disorder. Here, amino acid substitution K630A or both K632A and K633A have been shown to cause a significant slowdown of ligand-dependent nuclear translocation . Furthermore, the non-acetylated mutants misfold and form aggregates with several other proteins, including ubiquitin ligase E3, thus affecting proteosomal degradation. And yet another example involves serine phosphorylation in the period circadian protein homolog 2 protein (PER2). Mutation of S662 is associated with the familial advanced sleep phase syndrome, an autosomal dominant disorder with early sleep onset (around 7:30pm) and early awakening (around 4:30am), but normal sleep duration . Biochemical studies have shown that phosphorylation of S662 affects phosphorylation (by casein kinase CKIε) of several other residues in PER2, resulting in an overall hypophosphorylation of PER2. Interestingly, creation of a negative charge by S662D or an excess of CKIε restores the phosphorylation patterns of PER2. The current working hypothesis regarding PER2 is that phosphorylation of S662 likely creates a recognition site for CKIε and triggers a cascade of downstream effects. However, functional roles of phosphorylated PER2 are still largely unknown .
In addition to the individual examples, systematic studies implicating post-translational modifications in disease are now facilitated by the rapid growth of databases containing disease-associated mutations, human polymorphisms, and also post-translational modifications. One of the first such studies was carried out by Wang and Moult who analyzed protein structures and concluded that a large majority of human inherited disease mutations affect protein stability . Only a small percentage of amino acid substitutions were estimated to affect post-translational modifications and binding sites in general; however, only N-linked glycosylation was investigated. In addition, Wang and Moult studied only protein structures, whereas several types of post-translational modifications were shown to be preferentially occurring in the disordered protein regions [13–15]. Vogt et al. looked into the gain of N-linked glycosylation sites and their involvement in disease predicting that a number of disease associated mutations introduce changes in glycosylation patterns by creating NX[ST] motifs [16, 17]. Lee et al.  and Yang et al.  matched experimentally determined modification sites with amino acid substitutions from different databases and found 47 and 64 substitutions to affect post-translational modifications. In our previous work, we studied modification of confidently predicted phosphorylation sites affected by the somatic mutations and found that both gain and loss of phosphorylation target sites may be an active mechanism in cancer . This study was recently extended to include confident predictions of methylation, ubiquitination, and O-linked glycosylation, implicating all three modifications in disease [15, 21, 22].
In this study, we adopt a simple strategy and analyze a larger number of post-translational modifications in the context of disease-associated and putatively neutral amino acid substitutions. The experimentally verified sites of post-translational modifications were searched against the amino acid substitution databases with the goal of investigating whether and in what ways changes of post-translational modifications are affected by inherited and somatic disease mutations. We found that disease-associated mutations are enriched in the fraction of directly disrupted modification sites, but also those found in their close proximity. In contrast, the putatively neutral polymorphisms occur less frequently in the neighborhoods of the modification sites. Furthermore, we found that the sites of post-translational modifications were enriched in amino acid substitutions that change physicochemical properties of the wild-type amino acids.
The data sets of post-translational modifications were collected from several public databases and the literature. We mined Swiss-Prot , Human Protein References Database (HPRD) , phosphoELM , Protein Data Bank (PDB) , O-GlycBase , PhosphoSite , and PhosphoPOINT . Only modification types containing 50 or more instances were of interest, resulting in 15 different post-translational modifications from a number of different species. In total, these data sets contained 78,975 unique sites (Table 1).
The data set of the inherited amino acid substitutions in humans (Disease-I) was assembled from the Human Gene Mutation Database (HGMD)  and Swiss-Prot. The data set of somatic mutations in cancer (Disease-S) was also collected from Swiss-Prot and several recent cancer gene resequencing projects reviewed by Lee et al. . The sites already present in the Disease-I data set were removed from Disease-S. Finally, the putatively neutral polymorphisms (Neutral) were downloaded from the Swiss-Prot database. All polymorphisms found in the disease sets were removed. We assumed that only a small fraction of neutral polymorphisms may be involved in disease, that is, that the large majority of them are either neutral or have minor phenotypic effects. The data sets of amino acid substitutions are summarized in Table 2. In total, the set contained 73,463 amino acid substitutions from 12,987 proteins.
In order to investigate the relationships between post-translational modifications and amino acid substitutions, different scenarios were considered. First, a set of human post-translational modifications was created by: (1) including only those sites that were experimentally identified in human proteins, and (2) mapping of 25-residue long fragments from any other species (modification site ±12 amino acids around it) to the human proteins such that all 25 residues were identical to the corresponding residues in the human protein. Clearly, in the latter case, the correctness of such modification sites is not guaranteed; however, an exact 25-residue fragment match is expected to be a strong indication of functional similarity. This is often true for the modifications where only local interaction exists with the modifying enzyme (e.g. kinases), however, in some other cases with long-range interactions (e.g. E3 ligase binding in ubiquitination) the assumption may be less likely to hold. The fragment length of 25 was chosen based on the phosphorylation data for which there is evidence of physical kinase-substrate binding within about 7–12 residues of the modification site . We refer to the experimentally verified human modification sites as true sites, while the ones obtained by the exact fragment matches are referred to as the homology sites.
Two types of matching between amino acid substitutions and post-translational modifications were considered: (1) matches where the substitutions occurred at a modification site and (2) matches where the substitution site was in the neighborhood of the modification site (i.e. between residues −3 and +3). This matching was based on an assumption that a mutation can affect the post-translational modification if it is in the vicinity of the target residue. One such situation occurs with mutation R16C, which diminishes phosphorylation of S19, in human PTP synthase and causes hyperphenylalaninemia [32–34]. The situation where a substitution site and the modification site are at the same position is referred to as the direct match. A substitution site that is no more than 3 residues away from the modification site is referred to as the neighborhood match. An example of a neighborhood match to a homology site is shown in Figure 1.
With all the matched sites of post-translational modifications and amino acid substitutions, two strategies were adopted to estimate statistical confidence of the observed trends. First, we used the t-test and the binomial test to estimate whether a certain group of amino acid substitutions (Neutral, Disease-I, Disease-S) is enriched or depleted in a particular modification. The hypergeometric test was used to estimate enrichment and depletion in functional terms for each of the categories. The amino acid property changes were studied for the mutations in the neighborhood of modification sites (±3 residues). Three properties were investigated: side chain polarity (polar, non-polar), charge (positive, negative, neutral), and hydrophobicity (hydrophobic, hydrophilic), and the significance of those results was estimated using the t-test.
Positional conservation was calculated using a commonly used conservation index AL2CO . First, all 12,987 human proteins with mutations were searched against GenBank for the 500 best hits. These sequences were subsequently aligned using ClustalW program . Then, the positional entropy by Henikoff and Henikoff  was calculated as the conservation index for all modification sites. The conservation index value was normalized to the 0–1 interval; the higher the value a position gets, the more conserved the position is.
Using the two scenarios for obtaining post-translational modification sites (true and homologous sites) and using two strategies of matching them to the amino acid substitutions (direct and neighborhood matches), we analyzed the trends of amino acid substitutions in inherited disease, somatic disease, and neutral polymorphisms with respect to post-translational modifications.
The percentage of all amino acid substitutions that lie directly on or in the neighborhoods of modification sites in Disease-I, Disease-S, and Neutral data sets was investigated first. We found that direct and neighborhood mutations were in the vicinity of true and homology modification sites in 4.5% of cases in Disease-I, 3.1% of cases in Disease-S, and 2.1% of cases in Neutral data set (Figure 2). When only unique substitution sites were considered, these frequencies were 3.9%, 3.3%, and 2.1%, respectively (Figure 2). The most significant differences between the sets of inherited disease and neutral substitutions were detected in the cases of N-linked glycosylation (233 out of 306 in Disease-I; P = 1.6e−19), carboxylation (62/63; P = 2.4e−17), hydroxylation (68/72; P = 8.6e−16), acetylation (75/91; P = 4.9e−10), proteolytic cleavage (84/120; P = 1.9e−5), and O-linked glycosylation (32/41; P = 3.5e−4). Thus, the disease-associated mutations are more likely to affect post-translational modifications than the neutral substitutions (P = 5.6e−4).
Figure 3 shows the trends of enrichment and depletion of unique amino acid substitution sites that directly match experimentally verified (i.e. true) modification sites. The trend T was calculated as
where fobs and fexp ≠ 0 are the observed and expected rates (relative frequencies) of substitutions that match modification sites, respectively. The trend is positive if fobs > fexp and negative if fexp > fobs. T = 1 is the maximum value and involves a hypothetical situation with fexp =0 and fobs ≠ 0; T = −1 is the minimum value and indicates that fobs =0. Since inherited disease, somatic disease, and neutral polymorphisms contain 51%, 5%, and 44% of all substitution sites used in this study, fexp was set to 0.51, 0.05, and 0.44 for the three data sets, respectively. The ratio of the three groups of amino acid substitutions also determines the null hypothesis that was used to calculate statistical significance of the observed enrichment or depletion. Trends similar to those observed in Figure 3 were present when the matching process was extended to homologous modification sites and neighborhood matches (Figure 4).
Table 3 shows the observed numbers of amino acid substitutions matching post-translationally modified sites. The P-value for the positive trends T was calculated as
where K is the total number of matches to any of the three substitution sets and k is the observed number of matches in a particular data set. The P-value for the negative trends was calculated by replacing k by 0 and K by k in the limits of the summation operator.
Figure 3, Figure 4, and Table 3 indicate that the substitutions associated with inherited disease affect the sites of post-translational modifications with frequencies higher than expected by chance. In contrast, putatively neutral polymorphisms affect modification sites with lower-than-expected frequencies.
It has been widely studied that disease-associated mutation sites are more conserved than human polymorphic sites . Here, we analyzed the conservation of the post-translationally modified sites for which there are known amino acid substitutions either at the modification site itself or in its neighborhood. Not so surprisingly, we find that the modification sites directly matching disease-associated substitution sites are more conserved than those matching neutral polymorphic sites (Figure 5A). However, post-translational modifications lying in the neighborhood (±3 residues) of inherited disease mutations are also more conserved than the modification sites corresponding to the neutral polymorphisms, with a similar margin. On the contrary, the conservation of somatic mutations is significantly lower when the neighborhoods of modification sites were considered.
To further study the impact of amino acid substitutions that occur in the vicinity of the modification sites, we analyzed gene functions of all types of post-translational modifications. The genes were first separated into the gene set containing disease mutations (inherited and somatic) and the gene set containing neutral polymorphisms (the two sets of genes were overlapping). Each of the sets was further split into the set of genes where amino acid substitutions impact modifications sites and the remaining genes. Then, the gene enrichment analysis of the two pairs of data sets was performed using the GOstat software . It is important to mention that the set of disease-associated mutations impacting modification sites was compared to the remaining set of genes containing disease mutations in order to avoid biases that correspond to the disease genes . In this way, we assume that it may be possible to identify molecular and cellular functions that were disrupted by the disease mutations or those that are regulated by post-translational modifications and related to minor phenotypic variations. The results of this analysis, using the Gene Ontology  category molecular function, are shown for the phosphorylation data set (Figure 6).
Interestingly, both sets of genes (with disease related mutations and with neutral polymorphisms) in the vicinity of phosphorylation sites have significant enrichment in different molecular functions. For example, kinase, transferase, and signal transduction activities are significantly enriched in the disease-associated set, whereas RNA binding, transcription factor and receptor activities are significantly enriched in the neutral substitutions set. However, both sets are enriched in important molecular functions, thereby suggesting that both disease and neutral substitutions in the vicinity of phosphorylation sites, may have an impact on protein function.
Next, we analyzed physicochemical properties of the amino acid substitutions affecting post-translational modifications. Polarity, charge, and hydrophobicity were chosen for this analysis. These properties were studied for the amino acid substitutions occurring directly and in the neighborhood (±3 residues) of all types of modified sites. Figure 6 shows the enrichment and depletion of the observed changes in all three data sets. We observed that the inherited disease mutations are enriched in the change of all three properties for several modification types. On the other hand, neutral mutations are depleted in such changes. Somatic mutations do not show significant signals potentially due to the small size of the data set.
When personalized medicine is the next frontier for scientists, industry, and the general population, it is important to develop computational approaches that can lead to a better understanding of the etiology of disease. Integration of genetic and molecular information is a sensible step in this direction because it provides a structural and functional perspective to the human variation data.
In this study, we analyzed disease-associated and putatively neutral amino acid substitution data and found that about 4.5% of amino acid substitutions (3.9% of unique sites) may affect protein function through disruption of post-translational modifications. On the other hand, about 2% of neutral polymorphisms may be affecting post-translational modifications. These numbers further indicate that post-translational modifications are not the majority cause of human genetic disease. However, we have still found 238 post-translationally modified sites in human proteins whose mutation was causative of disease. In total, 1,289 modification sites were found to be in the close proximity to the inherited disease mutations and represent candidates for further experimental verification.
Given our data, there are several problems that could have lead to the ascertainment bias. For example, our data set of post-translational modifications was heavily skewed towards phosphorylation (79%), where mass spectrometry techniques have lead to a recent explosion in the number of identified sites. On the other hand, it may be argued that the modifications not identified using high-throughput methods may be more likely to be disease-relevant. It is also unclear whether the sets of inherited disease data is representative since it may be expected that genetic-association studies are more successful in identifying markers of monogenic diseases or familiar forms of complex diseases. Finally, the set of neutral polymorphisms is probably contaminated with yet undiscovered disease mutations and has not been controlled for population biases.
We also analyzed the enrichment and depletion of amino acid substitutions for each post-translational modification and found that most follow similar trends when inherited disease is compared to the neutral polymorphisms. These trends held for both experimentally verified modification sites and those transferred by homology. In the case of somatic mutations, we observed some interesting cases as well. For most examples, we have not found matches between post-translational modifications and observed somatic mutations. However, in the cases of methylation, phosphorylation and ubiquitination, there was an increased trend of disruption of post-translational modifications. Previous work has already addressed disruption of confidently predicted phosphorylation sites in cancer . Thus, the correspondence between actual sites and somatic mutations found in this study further supports this hypothesis.
While direct disruption of post-translational modifications is likely to have functional implications, the partial disruption of modified sites has a potential to lead to subtle phenotypic effects that may be more dependent on the variation present in other genes before causing organism-wide dysregulation. We believe that such changes are particularly fitting to the framework of complex disease and interaction between genetic and environmental factors.
This work was supported by the NIH grant R01 LM009722-01 to SDM, NIH grant R21CA113711 to LMI, and NSF grant DBI-0644017 to PR.
SHUYAN LI, School of Informatics and Computing, Indiana University, Bloomington, IN 47408, U.S.A., College of Chemistry and Chemical Engineering, Lanzhou University, Lanzhou, Gansu 730000, China.
LILIA M. IAKOUCHEVA, Laboratory of Statistical Genetics, The Rockefeller University, New York, NY 10065, U.S.A.
SEAN D. MOONEY, Buck Institute for Age Research, Novato, CA 94945, U.S.A.
PREDRAG RADIVOJAC, School of Informatics and Computing, Indiana University, Bloomington, IN 47408, U.S.A.