|Home | About | Journals | Submit | Contact Us | Français|
An important challenge in translational bioinformatics is to understand how genetic variation gives rise to molecular changes at the protein level that can precipitate both monogenic and complex disease. To this end, we compiled datasets of human disease-associated amino acid substitutions (AAS) in the contexts of inherited monogenic disease, complex disease, functional polymorphisms with no known disease association, and somatic mutations in cancer, and compared them with respect to predicted functional sites in proteins. Using the sequence homology-based tool SIFT to estimate the proportion of deleterious AAS in each dataset, only complex disease AAS were found to be indistinguishable from neutral polymorphic AAS. Investigation of monogenic disease AAS predicted to be non-deleterious by SIFT were characterized by a significant enrichment for inherited AAS within solvent accessible residues, regions of intrinsic protein disorder, and an association with the loss or gain of various post-translational modifications. Sites of structural and/or functional interest were therefore surmised to constitute useful additional features with which to identify the molecular disruptions caused by deleterious AAS. A range of bioinformatic tools, designed to predict structural and functional sites in protein sequences, were then employed to demonstrate that intrinsic biases exist in terms of the distribution of different types of human AAS with respect to specific structural, functional and pathological features. Our web tool, designed to potentiate the functional profiling of novel AAS, has been made available at http://profile.mutdb.org/.
Understanding the molecular consequences of the mutations that cause human genetic disease remains an important research challenge [Karchin, 2009; Mooney, 2005; Ng & Henikoff, 2006; Steward et al., 2003]. There are now several resources available which employ annotations describing biochemical features that are potentially useful for identifying function-altering and/or disease-associated amino acid substitutions (AAS), including SNPs3D [Yue et al., 2006], the SNP Function Portal [Wang et al., 2006], PolyDoms [Jegga et al., 2007], LS-SNP [Karchin et al., 2005a] and MutDB [Singh et al., 2007] among others. However, these resources typically use only sequence and structural features, such as evolutionary conservation in the vicinity of the site of mutation, and make no attempt to quantify the relative contributions made by specific molecular functions (features) that have either been introduced or disrupted by the mutations in question. Additionally, various tools have been developed to predict dysfunctional and/or disease-causing AAS. These include SIFT [Ng & Henikoff, 2003], PolyPhen [Ramensky et al., 2002], PMUT [Ferrer-Costa et al., 2005], PANTHER [Mi et al., 2007], LS-SNP [Karchin et al., 2005a], RCOL profiles [Terp et al., 2002], SNAP [Bromberg & Rost, 2007] and the SVM at SNPs3D [Yue et al., 2006] among others. All these tools operate using approximately the same principles, i.e. they are all supervised and employ features based on protein sequence, sequence conservation and/or protein structure. For example, approaches to classification of mutation sites have used linear regression [Chasman & Adams, 2001], neural networks [Bromberg & Rost, 2007], support vector machines [Krishnan & Westhead, 2003] and decision trees [Karchin et al., 2005a; Saunders & Baker, 2002]. These tools differ, however, in terms of their choice of training data, which can be datasets of human disease alleles as in the case of PolyPhen [Ramensky et al., 2002], evolutionary mutations that differentiate closely related species [Arbiza et al. 2006; Capriotti et al., 2008b], or experimentally induced mutations such as those originally studied with SIFT [Ng & Henikoff, 2003]. Other groups have used additional novel features such as physicochemical properties [Jiang et al., 2006], structural information [Tavtigian et al., 2008], information theory [Karchin et al., 2005b] and Gene Ontology terms [Calibrese et al., 2009]. These features have served to improve predictive accuracy. Despite differences in dataset construction and statistical inference models, the tools listed above yield remarkably similar predictions, an unsurprising finding since they were designed with a similar goal in mind: viz. to predict functional vs. non-functional mutations or disease vs. non-disease mutations.
However, since the features to be examined are nearly all exclusively based on protein sequence and structure, the currently available tools are inherently incapable of shedding light on the molecular causes of disease beyond simple disruptions of protein structure or sequence conservation. We have therefore set out to extend this area of enquiry by attempting to quantify the relative contributions made by different protein features when disrupted by mutation. To this end, we have evaluated the presence (and mutation-induced disruption) of a range of structural and functional features predicted by several different bioinformatics tools. Our approach was conceptually straightforward in that we utilized statistical inference methods to predict amino acid functions and then estimated how these predictions were altered by AAS.
Machine learning methods that predict structural and functional sites in amino acid sequences are well established and facilitate the prediction of secondary structure [Faraggi et al., 2009], solvent accessibility [Faraggi et al., 2009], post-translational modification [Iakoucheva et al., 2004] and enzyme catalysis [Youn et al., 2007]. These tools typically employ both sequence- and structure-based features, and have been trained on datasets of well characterized functional sites. For example, residues involved in enzyme catalysis can be predicted using machine learning methods trained on a database of catalytic sites [Porter et al., 2004]. Here, we have assessed the relative contributions of a range of functional site features to protein disruption in several disease-associated mutation datasets as well as a dataset of mutations which (in all likelihood) lack functional significance. We used several methods to predict structural features, post-translational modification and catalytic residues in the analysis of five test datasets containing different types of human amino acid substitutions: (i) mutations causing inherited disease, (ii) somatic cancer-associated mutations identified in breast and colorectal tumours, (iii) somatic cancer-associated amino acid substitutions identified in protein kinase genes from diverse human tumours, (iv) functional polymorphisms (with no known disease association), (v) putatively functional polymorphisms associated with human inherited disease, and a control dataset of putatively neutral polymorphisms. The results of our study indicate a significant difference between disease and non-disease associated variants in terms of both the structural and functional features disrupted.
Five distinct sets of amino acid substitutions (AAS) with different disease annotations were collected for the purposes of this analysis. Firstly, heritable AAS from the Human Gene Mutation Database (HGMD, August 2007; http://www.hgmd.org; Stenson et al., 2009) were grouped into three different categories:
Secondly, two additional datasets of somatic cancer-associated AAS were obtained from recent cancer resequencing studies [Sjöblom et al., 2006; Greenman et al., 2007]. The first of these datasets comprised those AAS identified in exons derived from 20,857 transcripts from 11 breast and colorectal tumours. This breast and colorectal cancer dataset (henceforth referred to as the ‘Cancer’ dataset) represents 1099 somatic substitutions from 847 different human genes [Sjöblom et al., 2006]. The second dataset of cancer-associated AAS comprises 695 somatic substitutions identified in the exons of 518 protein kinase genes from 312 diverse human tumours, henceforth referred to as the ‘Kinase’ dataset [Greenman et al., 2007].
Finally, a set of AAS (annotated as ‘polymorphism’), downloaded from the UniProtKB/Swiss-Prot database [Boeckmann et al., 2003; ftp://ftp.ebi.ac.uk/pub/databases/swissprot/release/docs/humsavar.txt], comprised the neutral AAS used in this study. This dataset represents one of the most extensive sources of putatively neutral polymorphism data available but is nevertheless unlikely to represent a truly neutral dataset since at least some of the component AAS could yet prove to have an association with disease [Care et al., 2007]. To further improve the neutral credentials of this dataset, any AAS that were concurrently annotated in HGMD as being disease-causing or of potential functional significance were removed (N.B. 1,589 AAS were excluded in this way). In addition, since rare missense alleles are inherently more likely to be deleterious than common missense alleles [Kryukov et al., 2007], only those AAS in the UniProtKB/Swiss-Prot dataset that occurred at polymorphic frequencies (≥1% in a population of European descent; Hap-Map-CEU) were retained. This putatively neutral set of AAS therefore contained 8,509 human polymorphisms (taken from a total of 4,864 different genes) and shall henceforth be known as the ‘Swiss-Prot neutral’ dataset. Once again, it should be noted that we cannot wholly exclude the possibility that a subset of these supposedly neutral polymorphisms could be of functional importance or that they might have a role either in complex disease or as modifiers of disease susceptibility. Table 1 summarizes the above mutation datasets.
Using gene ontology (GO) terms (http://www.geneontology.org), lists of inherited disease genes matching the GO terms for oncogene (GO:0008151) or ‘tumour suppressor’ and ‘anti-oncogene’ (GO:0045786) were compiled. The first subset of AAS in tumour suppressor genes comprised 1,227 AAS from 33 genes. The second subset of AAS in oncogenes contained 288 AAS from 26 genes.
The tools employed in this analysis were sequence-based and included measures of structure, function and post-translational modification. The tools chosen were of sufficiently high accuracy to be useful in testing biological hypotheses. Secondary structure (80% accuracy) and solvent accessibility (79% accuracy) were predicted using SPINE [Dor & Zhou, 2005]. Protein structure stability was assessed using Imutant [Capriotti et al., 2008a] (77% accuracy). Regions of intrinsic protein disorder were predicted using VSL2B predictor [Peng et al., 2006] (>85% accuracy). Short structured or loosely structured helical regions within long disordered regions (so-called Molecular Recognition Fragments, MoRFs) were identified using a predictor of calmodulin-binding targets, CaMBTP [Radivojac et al., 2006] (81% accuracy). Post-translational modification sites were identified using DisPhos to identify phosphorylation sites [Iakoucheva et al., 2004] (75% accuracy claimed for serine, threonine and tyrosine residues), OGlycoPred to identify O-linked glycosylation sites (77% accuracy claimed for serine, threonine, proline and lysine; Radivojac, unpublished work), UbPred to predict sites of ubiquitination [Radivojac et al., 2009] (72% accuracy claimed) and MethylPred to predict sites of protein methylation [Daily et al., 2005] (71% accuracy claimed for arginine and lysine residues). Catalytic sites were ascertained with a catalytic residue predictor termed CRP [Youn et al., 2007] (65% accuracy claimed over all residues). Finally, SIFT [Ng & Henikoff, 2003] was used to predict whether or not the AAS were deleterious. The tools described above were employed in order to interrogate both the wild-type and the mutant sequences; any change in prediction scores between the wild-type and mutant sequences was recorded. Conservative cutoffs were employed to minimize the false discovery rates. For the tools that generated predictions as probabilities (i.e. disorder, calmodulin-binding sites, phosphorylation, O-linked glycosylation, ubiquitination, methylation and catalytic residues), only ‘high confidence sites’ were considered, defined here as sites with a false positive prediction rate of ~0.1 (estimated during model evaluation).
The large size of the combined datasets (41,442 AAS in total) from this study makes it impractical for each individual AAS to be functionally investigated in vitro. However, it should be possible, at least in principle, to validate a subset of our in silico predictions against a series of amino acid residues of known structural/functional importance. To this end, a test sample of 6,073 AAS (from 1209 distinct proteins) was selected from the total (41,442) AAS under study. These AAS represented all AAS from the proteins for which functional data on stability, secondary structure, solvent accessibility, disordered regions, calmodulin-binding sites, catalytic site residues and post-translational modification (methylation, phosphorylation, O-linked glycosylation and ubiquitination) could be obtained. This test sample of AAS were then assessed in order to establish whether the original structural/functional predictions were true positives (TP), false positives (FP), true negatives (TN) or false negatives (FN).
The in vitro data on the 6,073 AAS test sample, required for validating our original in silico predictions, were obtained from publically available databases augmented by searches of the scientific literature. In vitro data on the consequences of AASs (in 16 human proteins) for protein stability were obtained from Allali-Hassani et al. (2009). The program DSSP (Kabsch et al. 1983) was used to extract secondary structure and solvent accessibility information from 12 human proteins with known X-ray crystallographic structures. The locations of disordered regions within 61 human proteins were obtained from DisProt v. 4.9 (Sickmeier et al. 2007). The locations of known calmodulin-binding sites in 10 human proteins were obtained from the Calmodulin Target Database (Yap et al. 2000). The locations of catalytic site residues in 65 human proteins were obtained from the Catalytic Site atlas (v. 2.2.11; Porter et al., 2004). Finally, the UniProt Knowledgebase (Release 15.7) and Human Protein Reference database (HPRD; Keshava Prasad et al. 2009) together yielded data on post-translational modifications for 1140 human proteins.
For the Inherited disease AAS, the disease terms recorded in the original publications were mapped to the Unified Medical Language System (UMLS) using a simple word permutation-based method developed and tested by Shah et al. [Shah et al., 2006; Shah et al., 2007]. The disease names were mapped to UMLS concept identifiers (CUI) using the open source UMLS-Query module [Shah & Musen, 2008]. UMLS-Query provides a function called maptoId, which accepts a phrase (up to 10 words) and maps it to a CUI (and can be restricted by a vocabulary if so desired). The function first looks for an exact match for the phrase; if none is found, it will generate all possible permutations and attempt an exact match for each one. The function also performs right truncation to look for partial matches. For example, calling the function to find a CUI belonging to the SNOMED-CT for ‘intraductal carcinoma of prostate’ will match concepts “intraductal” (C0007124) as well as “carcinoma of prostate” (C0600139). Permutation generation along with right truncation is conceptually similar to using skip n-grams for matching concepts. Skip bigrams have been shown to perform at or above state-of-the-art measures with less complexity, for the purpose of identifying matching concepts [Reeve & Han, 2007]. Some 23,594 (~80% of the total) disease terms relating to the Inherited disease AAS were mapped to the UMLS with high confidence. The hierarchy of disease terms from the SNOMED-CT ontology was used to explore the relationships between the disease states and the underlying molecular phenotypes.
Using Swiss-Prot neutral as a control dataset, we compared the distribution of the structural and functional sites for each dataset (Inherited disease, Disease-associated polymorphism, Functional polymorphism, Kinase and Cancer) against the Swiss-Prot neutral distribution. To allow for multiple testing, the significance of any difference noted was then assessed by means of Fisher's exact test with Bonferroni correction. Only P values < 0.00172 (0.05/29) were considered significant.
Identifying the biological functions disrupted by specific amino acid substitutions (AAS) is an important challenge that has relevance both for understanding the underlying molecular mechanism(s) of a given disease and for identifying functional polymorphic sites that may impact upon both complex disease and disease susceptibility. The enrichment of AAS at residues of structural or functional importance in each dataset was compared and contrasted, as depicted in Figures Figures11 and and22.
SIFT was used to predict the proportion of deleterious AAS in each missense mutation dataset. The Inherited disease, Functional polymorphism, Cancer and Kinase datasets were all characterized by a significant enrichment in the proportion of substitutions predicted to be deleterious when compared to the putatively neutral Swiss-Prot neutral dataset (see Figure 1; SIFT). For the Inherited disease dataset, ~76% of AAS were predicted to be deleterious (average SIFT score 0.072), a value very similar to the proportion (69%) previously predicted to be deleterious using disease-causing AAS from UniProtKB/Swiss-Prot [Ng & Henikoff, 2003]. For the Functional polymorphisms, 59% of AAS were predicted to be deleterious using SIFT (average SIFT score 0.162).
By contrast, SIFT predicted that only ~25% of Disease-associated polymorphisms were deleterious (average SIFT score 0.38), a proportion almost identical to the 22% noted for the Swiss-Prot neutral control dataset. Since the ±20% accuracy range of the SIFT method [Ng & Henikoff, 2003] renders reliable discrimination of these datasets impossible, we must conclude that there is no evidence for a significant difference between the two datasets. There are two plausible explanations to account for the marked similarity between the Swiss-Prot neutral and Disease-associated polymorphism datasets in terms of their SIFT scores. Firstly, the contribution of disease-associated polymorphisms to disease may well be additive via the net effect of multiple subtle modifications to function [Schork et al., 2009]. In agreement with this assertion, we found that the Disease-associated polymorphisms were located mainly in exposed residues (55.2%) or within disordered regions (19.2%). Such residues tend be less highly conserved evolutionarily than those which are buried within the protein structure. Hence, polymorphic variants in these locations may exert a subtle influence on protein function rather than a drastic one. Since SIFT employs evolutionary conservation as a proxy to predict function, it may be beneficial to retrain the method with these AAS when using SIFT to make predictions regarding polymorphic AAS located within exposed residues or disordered regions. For example, disordered protein regions have been shown to exhibit different rates of evolution (Brown et al., 2002) and different amino acid substitution patterns (Radivojac et al., 2002) than ordered regions. The alternative possibility is that a large proportion of disease-associated polymorphisms (considered by the original authors reporting them to be directly causative of the disease association) are not in reality the variants directly responsible for the disease association. Instead, they may simply be closely linked to (and/or in strong linkage disequilibrium with) those additional, and hitherto undetected, functional variants actually responsible for the observed disease associations.
Under the assumption that all the AAS in the Inherited disease dataset do indeed represent causative variants underlying the various genetic diseases as claimed by the original reporting authors, it can be seen that only 76% of them are predicted by SIFT to disrupt protein function. If we break down the SIFT predictions on a gene-wise basis for inherited disease, we see that SIFT prediction accuracy (i.e. the proportion of inherited disease-causing AAS predicted to disrupt protein function) ranged from 31-100% (Supp. Table S1 & S2). Analyzing a subset of 6,457 Inherited disease AAS which SIFT had predicted not to be of functional significance (i.e. ‘tolerant’, denoting tolerated), revealed that ~50% (3,210 AAS) were located in surface exposed regions, representing a significant enrichment over the Inherited disease dataset as a whole (+20%; P = 4.15·10−199, Fisher's exact test). The predicted ‘tolerant’ Inherited disease subset was also significantly depleted, as compared to the entire Inherited Disease dataset, with respect to AAS giving rise to a decrease in protein stability of ≥1 kcal/mol (−5.4%; P = 2.14·10−7, Fisher's exact test) and enriched for AAS located in disordered regions (+1.5%; P = 3.2·10−4, Fisher's exact test). The ‘tolerant’ Inherited disease subset also exhibited a significant enrichment for AAS predicted to result in the loss of phosphorylation sites (+0.4%; P = 2.55·10−5, Fisher's exact test) and AAS giving rise to a gain of ubiquitination sites (+0.3%; P = 4.43·10−4, Fisher's exact test).
The ‘tolerant’ Inherited disease subset exhibited similarities to both the Disease-associated polymorphism and Cancer datasets e.g. in terms of the distribution of mutations in both surface-exposed residues (>50%) and within disordered regions (~20%). It may nevertheless be important, when evaluating AAS in exposed or disordered regions, to attribute a lower confidence level to the ‘tolerant’ label assigned by SIFT; this may hold true especially when evaluating polymorphic AAS which by their very nature tend to be located in evolutionarily less highly conserved regions.
The 6,073 AAS of the test sample represented all AAS from the proteins for which functional data on stability, secondary structure, solvent accessibility, disordered regions, calmodulin-binding sites, catalytic site residues and post-translational modification (methylation, phosphorylation, O-linked glycosylation and ubiquitination) could be obtained. Nine examples of these AAS, representing nine of the features being considered here, and known to be associated with a human inherited disease, are listed in Supp. Table S3.
The standard benchmarking statistics used to evaluate the structural/functional predictions made on the test sample of 6,073 AAS were the false positive rate (FPR); sensitivity; specificity; Matthews Correlation Coefficient (MCC; Matthews, 1975) and the accuracy (the mean of sensitivity and specificity scores). The Matthews Correlation Coefficient (MCC) was employed since it represents one of the best available measures of prediction quality. It returns a value between −1 and +1; a coefficient of −1 represents the worst possible prediction, 0 a random prediction and +1 a perfect prediction. The validation of the original in silico predictions for the test sample of 6,073 AAS is summarized in Supp. Table S4. MCC values for the various predictors were in the range of 0.125-0.701. Our combined algorithm would therefore appear to have performed best on the features; secondary structure, solvent accessibility, calmodulin-binding, O-linked glycosylation and ubiquitination (MCC>0.50). The in silico predictions for phosphorylation sites had by far the lowest MCC score (0.125) but it should be appreciated that even in quite well characterized proteins, it is highly likely that not all bona fide phosphorylation sites will have been identified experimentally. In summary, using our combined algorithm on our test sample of 6,073 AAS, we were able to achieve a sensitivity of 0.63 and a specificity of 0.93 with respect to identifying sites of known structural/functional importance. Despite the relatively small size of the test sample, this validation serves to confirm that our original in silico predictions are, at the very least, of sufficient quality for the generation and testing of biological hypotheses. Furthermore, since virtually all the prediction models we employed were trained on data derived from a number of different species, the chance of ‘overfitting’ to human data is minimized.
The structural properties of the sites altered by amino acid substitution in the different datasets are also summarized in Figure 1. Buried residues located within the core of a protein have long been known to be important for protein folding and stability [Sandberg et al., 1995] whereas residues located at or close to the surface of a protein are more likely to be involved in protein-protein interactions [Ye et al., 2006]. When solvent accessibility was considered, a significant enrichment of AAS at buried sites was noted for both the Inherited disease and Functional polymorphism datasets: 70% (P < 2.2·10−16; Fisher's exact test) and 58% (P < 2.2·10−16; Fisher's exact test) respectively as compared to 41% of Swiss-Prot neutral (Figure 1, solvent accessibility, ‘buried’). By contrast, the Kinase (solvent accessibility, 43% AAS buried), Cancer (solvent accessibility, 43% AAS buried) and Disease-associated polymorphism datasets (solvent accessibility, 44% AAS buried) were indistinguishable from the Swiss-Prot neutral dataset in terms of their solvent accessibility, indicating that the pathogenic effects of their AAS may not be biased towards the disruption of intrinsic structural properties in the same way that Inherited Disease and Functional polymorphisms are.
Since our prediction algorithms are sequence-based, we can readily evaluate (without structural modeling) changes in prediction for specific AAS by running the bioinformatics tools on both the wild-type and mutant sequences. It should be noted that large predicted structural changes are based primarily upon local sequence features and do not take the entire protein sequence into account. However, although these features may not indicate actual mutation-induced conformational changes (e.g. the conversion of an alpha-helix into a beta-sheet), by performing this experiment we are testing whether these features represent good indicators of a disruptive mutation. To test whether these findings were significant by comparison with the putatively neutral AAS, we calculated the enrichment (or depletion) with respect to the Swiss-Prot neutral dataset, determining significance using Fisher's exact test with a Bonferroni correction. When we examined the change in protein stability consequent to a given amino acid substitution, we observed a significant enrichment for AAS that give rise to a decrease in stability of ≥1kcal/mol but only for the Inherited disease dataset (+9.7%; P = 3.9·10−24; Fisher's exact test). In terms of the change of predicted solvent accessibility due to amino acid substitution (Exposed>Buried & Buried>Exposed, Figure 3), only the Cancer and Kinase datasets exhibited a significant enrichment for AAS predicted to be located at surface exposed residues in the wild-type protein but buried as a consequence of the amino acid substitution (Exposed>Buried; Cancer = +2.9%; P = 1.3·10−3; Kinase = +3.9%; P = 6.2·10−4; Fisher's exact test; Figure 3). Such Exposed>Buried transitions are likely to exert a dramatic effect upon protein function.
When secondary structure was explored, the Inherited disease and Functional polymorphism datasets were both found to be significantly enriched in AAS within alpha-helical regions (Inherited disease = +4.3%; P = 8.9·10−14; Functional polymorphism = +19.2%; P = 1.7·10−29; Fisher's exact test) but significantly reduced in AAS located in coiled regions; as compared to the Swiss-Prot neutral dataset (Inherited disease = −11.2%; P = 6.4·10−73; Functional polymorphism = −17.3%; P = 9.8·10−23; Fisher's exact test; Figure 1). The increased number of AAS in the alpha-helical regions for both the Inherited disease and Functional polymorphism datasets may be attributed to the fact that helices constitute one of the most common recognition motifs in proteins [Che et al., 2007]. It follows that modifying these regions may alter the biological activities of the protein involved. An example of a protein with an enrichment of disease-causing AAS within alpha-helical regions is keratin 12 (KRT12) in which AAS often only occur in the highly conserved alpha-helical regions essential for keratin filament assembly (alpha-helix-initiation motif of rod domain 1A or alpha-helix-termination motif of rod domain 2B) [Nishida et al., 1997]. The depletion of AAS within coiled regions for Inherited disease mutations and Functional polymorphisms may be related to the lack of a specific three dimensional structure (barring a few exceptions) in coiled regions.
The Inherited disease dataset was also significantly enriched for AAS in beta-sheet regions (+6.8%; P = 1.62·10−49, Fisher's exact test) and for changes of predicted secondary structure due to amino acid substitution from a beta-sheet region to a coiled or alpha-helical region (Sheet>Helix,Coil; +0.9%; P = 3.4·10−7; Fisher's exact test; Figure 3). The conversion of a beta-sheet into an alpha-helical region may lead to new and deleterious interactions in the disease state since helical regions are the most common recognition motifs of proteins [Che et al., 2007]. The strong bias of Inherited disease mutations and (especially) Functional polymorphisms towards alpha-helical regions suggests that secondary structure may represent a particularly informative feature for machine learning and the computational classification of deleterious AAS.
Intrinsically disordered (ID) protein regions lack a unique 3-D structure and exist in a dynamic ensemble of different conformations [Dunker et al., 2001]. Their functional roles are well documented and they tend to be enriched in regulation and signaling via protein–protein and protein-nucleic acid interactions [Dyson & Wright, 2005; Radivojac, 2007]. The number of AAS from the Inherited disease (3034 AAS; −17.83%; P = 2.2·10−16; Fisher's exact test), Disease-associated polymorphism (146 AAS; −8.89%; P = 6.2·10−8; Fisher's exact test), Functional polymorphism (58 AAS; −21.63%; P = 8.83·10−56; Fisher's exact test) and Kinase (120 AAS; −10.86%; P = 1.44·10−10; Fisher's exact test) datasets occurring within ID regions was significantly reduced by comparison with the Swiss-Prot neutral dataset (Figure 1). Although ID regions were significantly depleted in the vicinity of Disease-associated polymorphisms, AAS within ID regions still account for ~19% of this dataset, as compared to 10% of the Inherited Disease dataset.
Disease-associated polymorphisms in ID regions may play an additive role in complex disease [e.g. p.G460W (ADD1, MIM# 102680) which is associated with hypertension; Cusi et al., 1997] or may in some cases act as disease modifiers for disease-causing mutations [as in the case of p.H558R (SCN5A, MIM# 600163) which modifies the effects of the disease-causing p.T512I on Na+ channel function [Viswanathan et al., 2003] or p.A115S in xylosyltransferase I (XYLT1, MIM# 608124) which is associated with higher serum XT activity and acts as a disease modifier in pseudoxanthoma elasticum (PXE, MIM# 264800); Schön et al., 2006].
It was also noted that 26% of those entries in the Disease-associated polymorphism dataset which were located within ID regions, were associated with cancer susceptibility. This supports previous work which has highlighted the importance of intrinsic disorder in cell signaling and cancer-associated proteins [Iakoucheva et al., 2002]. One example is the missense polymorphism p.A538T (HIF1A, MIM# 603348) which is located within an ID region and is associated with renal carcinoma [Ollerenshaw et al., 2004].
The distributions of AAS predicted at functional amino acid residues are summarized in Figure 2 for the different datasets. The Inherited disease dataset was characterized by a significant enrichment of AAS located at catalytic residues (+2.32%; P = 7.54·10−10; Fisher's exact test) but displayed a significant paucity of AAS at calmodulin-binding sites (−0.62%; P = 1.12·10−3; Fisher's exact test) and at three different sites of post-translational modification including O-linked glycosylation (−0.64%; P = 7.29·10−17; Fisher's exact test), ubiquitination (−0.45%; P = 1.12·10−12; Fisher's exact test) and phosphorylation (−2.01%; P = 1.47·10−54; Fisher's exact test) (Figure 2). The Functional polymorphism dataset also displayed a paucity of AAS at phosphorylation sites (−2.1%; P = 2.17·10−6; Fisher's exact test). The Cancer dataset was significantly enriched for AAS at calmodulin-binding sites (+1.82%; P = 1.62·10−3; Fisher's exact test; Figure 2, Calmodulin-binding sites). When we examined the change in functional site (gain or loss) consequent to a given amino acid substitution, the Functional polymorphisms dataset was significantly enriched for gains of catalytic residues (+2.63%; P = 3.76·10−4; Fisher's exact test; Figure 4) consequent to AAS. The Inherited disease and Functional polymorphism datasets both displayed a paucity of AAS giving rise to losses or gains of phosphorylation whilst the Inherited disease dataset also exhibited a paucity of AAS resulting in the loss of O-linked glycosylation sites and the gain or loss of ubiquitination sites.
In terms of functional sites, the Inherited disease dataset was found to be enriched in AAS at catalytic residues but depleted in AAS at three types of post-translational modification tested. The Functional polymorphisms dataset was also enriched in AAS giving rise to the gain of catalytic sites whereas the Cancer dataset was enriched for AAS at calmodulin-binding sites.
Two subsets of missense mutations were derived from the Inherited disease dataset viz. germline AAS from tumour suppressor genes (1,227 AAS; 33 genes) and germline AAS from oncogenes (288 AAS; 26 genes). Disease-causing missense substitutions in oncogenes are usually dominant gain-of-function mutations whereas their counterparts in tumour suppressor genes tend to be recessive loss-of-function mutations. Overall, disease-causing AAS in oncogenes and tumour suppressor genes exhibited significant differences in terms of both their SIFT-predicted deleteriousness and the distribution of AAS within regions of intrinsic protein disorder (Table 2). Some 69.2% of tumour suppressor AAS and 82.3% of oncogene AAS were predicted by SIFT to be deleterious (‘intolerant’) (P = 8.89·10−3; Fisher's exact test). With respect to the distribution of AAS within protein regions of intrinsic disorder, these two subsets exhibited significant differences, with 15.1% of tumour suppressor AAS and 4.3% of oncogene AAS located within disordered regions (P = 1.8·10−7; Fisher's exact test).
Inspection of a ‘heat map’ depicting the enrichment or depletion of all AAS by disease category (Figure 5) reveals that only a few of the inherited disease AAS classes exhibit statistically significant differences in terms of the underlying molecular function disrupted. Blood coagulation disorders were found to exhibit a significant depletion in terms of post-translational modification sites, including a 19-fold depletion in AAS at O-linked glycosylation sites (P = 1.2·10−6; Fisher's exact test; Figure 5) and a 13-fold depletion at ubiquitination sites (P = 1.7·10−4; Fisher's exact test; Figure 5). Genitourinary disorders exhibited a 6-fold depletion in AAS at phosphorylation sites (P = 2.4·10−10; Fisher's exact test; Figure 5). Nutritional diseases exhibited a 13-fold depletion for AAS located both within disordered regions and at phosphorylation sites (P = 6.7·10−80 and P = 1.6·10−8 respectively; Fisher's exact test; Figure 5). Developmental and psychiatric disorders both showed a 2-fold enrichment for AAS located in calmodulin-binding sites (P = 3.6·10−4 and P = 2.9·10−4 respectively; Fisher's exact test; Figure 5). Overall, however, the predicted enrichment or depletion of specific protein features was not found to be an inherited disease-specific phenomenon.
The Inherited disease dataset displayed significant differences with respect to the Swiss-Prot neutral dataset in terms of SIFT predictions, structural features (protein stability, secondary structure, solvent accessibility, protein disorder) and functional sites (catalytic residues, sites of phosphorylation, ubiquitination & O-linked glycosylation). When the Disease-associated polymorphism dataset was compared against the neutral Swiss-Prot neutral dataset, the only significant differences identified were those involving the depletion of AAS in intrinsically disordered regions of proteins. Functional polymorphisms were found to be intermediate between the Inherited disease mutations and the Disease-associated polymorphisms, and differed from the neutral Swiss-Prot neutral dataset in terms of SIFT prediction, solvent accessibility, secondary structure, disordered regions, phosphorylation sites and gain of catalytic residues. Disease-associated polymorphisms are often associated with complex traits and it is therefore very likely that they exert subtle effects which, either singly or in combination with other genetic or environmental factors, give rise to a disease state/susceptibility. This contrasts with monogenic disease in which we show [and others (Wang and Moult, 2001; Yue et al., 2005) have previously shown] that disruption of protein stability is the main underlying causative factor. We further postulate that disease-associated polymorphisms are biased towards exerting their influence via the subtle modification of functional sites at exposed residues (~55%) or by modifying functional sites within disordered regions (~20%).
Both the somatic datasets are likely to contain a proportion of ‘passenger’ as opposed to ‘driver’ mutations. Consistent with this expectation, SIFT predicted 47% of Cancer mutations and 54% of Kinase mutations to be deleterious. Although one might intuitively expect there to be fewer passenger mutations in the more focused Kinase dataset [Torkamani & Schork, 2008], in practice the slight excess of deleterious mutations in this dataset was not statistically significant.
The Kinase dataset was significantly depleted in disordered regions reflecting the idiosyncrasies of the structure of the proteins in the Kinase dataset. Both the Cancer and Kinase datasets exhibited significant enrichment for radical changes to protein structure via changes of solvent accessibility from buried to exposed, consequent to AAS. The Cancer dataset was found to be significantly enriched for AAS at calmodulin-binding sites which are short or loosely structured helical segments within otherwise disordered regions and can be seen as being analogous to Molecular Recognition Fragments (MoRFs) [Mohan et al., 2006]. Since MoRFs exhibit molecular recognition and binding functions [Mohan et al., 2006], the Cancer AAS in these regions are likely to disrupt a wide range of functions in the cell including signaling and protein-protein interaction sites. We speculate that the ‘drivers’ in the Cancer and Kinase datasets act via radical changes to protein structure, indicated by the significant enrichment of AAS predicted to alter solvent accessibility (Buried>Exposed), whilst ‘drivers’ in Cancer are also likely to exert their effects via the disruption of molecular recognition sites (e.g. protein-protein interaction sites).
The limitations of our study revolve around both the datasets and the prediction tools employed. For the Inherited disease and Disease-associated polymorphism datasets, multiple lines of evidence were used to assign an AAS as being causative of a disease phenotype. Despite the best efforts of the reporting authors and database curators, there are however likely to be a proportion of AAS in each dataset that are not actually causative of the associated disease even although they have been reported as being so. This is especially true for the Disease-associated polymorphism dataset where the majority of AAS have been reported as being causative despite there often being no direct evidence for this assertion (e.g. from functional studies etc). Therefore, a proportion of the AAS in the Disease-associated polymorphism dataset may simply be in linkage disequilibrium with the actual causative variant(s) rather than being the causative variant(s) themselves. The future use of data derived from emerging functional assays holds out the promise of generating improved disease mutation datasets that can be used to train computational classifiers [Couch et al., 2008].
Both the somatic datasets (Cancer and Kinase) are also problematic in that they are expected not only to contain mutations that lead to neoplastic progression (‘drivers’) but also neutral mutations that have arisen as a consequence of the greatly increased mutation rates in tumour cells but do not directly influence the process of tumorigenesis in any way (‘passengers’) [Greenman et al., 2007].
The in silico tools selected for this study were validated by assessing the accuracy of predictions made for known functional sites (Suppl. Table S4). The accuracy of the predictions used in subsequent analyses should therefore have been high enough to be useful both in generating and in testing the various biological hypotheses put forward.
Our prediction that 70% of Inherited disease AAS and 58% of Functional polymorphisms are located within buried residues is consistent with the view that disruption of protein stability is a key feature of mutations that cause inherited disease. By contrast, Inherited disease AAS predicted to be non-deleterious (‘tolerant’) by SIFT were characterized by significant enrichment for AAS within solvent accessible residues, regions of intrinsic protein disorder, and in association with the loss or gain of various post-translational modifications. Although sequence conservation is a powerful feature for the prediction of deleterious AAS in the Inherited disease dataset, it lacks resolution especially when examining polymorphic AAS within exposed (solvent accessible) or disordered regions. It is important not to neglect the role that the disruption of functional residues undoubtedly plays in disease pathogenesis especially for complex disease. Therefore, the incorporation of structural and functional sites as additional features in machine learning algorithms is likely to improve our ability to identify computationally deleterious AAS especially in the case of polymorphic AAS.
Finally, we have constructed a web resource which can be used for in silico functional profiling. This Feature Server tool can be found at http://profile.mutdb.org. Using the CakePHP development framework, users can submit their own mutations for characterization. Once a protein sequence and amino acid substitution(s) has been submitted, a script is run that calculates the predicted gain and/or loss of all the bioinformatic features discussed here.
This research was supported by NSF awards DBI-0644017 (PI: Radivojac), K22LM009135 (PI: Mooney), R01LM009722 (PI: Mooney), a grant from IU Biomedical Research Council, Indiana University, the Showalter Trust and the Indiana Genomics Initiative. The Indiana Genomics Initiative (INGEN) is supported in part by the Lilly Endowment.