Elucidating the effects of naturally occurring genetic variation is one of the major challenges for personalized health and personalized medicine. Here, we introduce SNAP2, a novel neural network based classifier that improves over the state-of-the-art in distinguishing between effect and neutral variants. Our method's improved performance results from screening many potentially relevant protein features and from refining our development data sets. Cross-validated on >100k experimentally annotated variants, SNAP2 significantly outperformed other methods, attaining a two-state accuracy (effect/neutral) of 83%. SNAP2 also outperformed combinations of other methods. Performance increased for human variants but much more so for other organisms. Our method's carefully calibrated reliability index informs selection of variants for experimental follow up, with the most strongly predicted half of all effect variants predicted at over 96% accuracy. As expected, the evolutionary information from automatically generated multiple sequence alignments gave the strongest signal for the prediction. However, we also optimized our new method to perform surprisingly well even without alignments. This feature reduces prediction runtime by over two orders of magnitude, enables cross-genome comparisons, and renders our new method as the best solution for the 10-20% of sequence orphans. SNAP2 is available at: https://rostlab.org/services/snap2web
Delta, input feature that results from computing the difference feature scores for native amino acid and feature scores for variant amino acid; nsSNP, non-synoymous SNP; PMD, Protein Mutant Database; SNAP, Screening for non-acceptable polymorphisms; SNP, single nucleotide polymorphism; variant, any amino acid changing sequence variant.
functional effect prediction; variant effect; neural network; from sequence; SNP effect
Background & Aims
Genome-wide association studies (GWASs) have identified 140 Crohn’s disease (CD) susceptibility loci. For most loci, the variants that cause disease are not known and the genes affected by these variants have not been identified. We aimed to identify variants that cause CD through detailed sequencing, genetic association, expression, and functional studies.
We sequenced whole exomes of 42 unrelated subjects with Crohn’s disease (CD) and 5 healthy individuals (controls), and then filtered single-nucleotide variants by incorporating association results from meta-analyses of CD GWASs and in silico mutation effect prediction algorithms. We then genotyped 9348 patients with CD, 2868 with ulcerative colitis, and 14,567 controls, and associated variants analyzed in functional studies using materials from patients and controls and in vitro model systems.
We identified rare missense mutations in PR domain-containing1 (PRDM1) and associated these with CD. These increased proliferation of T cells and secretion of cytokines upon activation, and increased expression of the adhesion molecule L-selectin. A common CD risk allele, identified in GWASs, correlated with reduced expression of PRDM1 in ileal biopsies and peripheral blood mononuclear cells (combined P=1.6×0−8). We identified an association between CD and a common missense variant, Val248Ala, in nuclear domain 10 protein 52 (NDP52) (P=4.83×10−9). We found that this variant impairs the regulatory functions of NDP52 to inhibit NFκB activation of genes that regulate inflammation and affect stability of proteins in toll-like receptor pathways.
We have extended GWAS results and provide evidence that variants in PRDM1 and NDP52 determine susceptibility to CD. PRDM1 maps adjacent to a CD interval identified in GWASs and encodes a transcription factor expressed by T and B cells. NDP52 is an adaptor protein that functions in selective autophagy of intracellular bacteria and signaling molecules, supporting the role for autophagy in pathogenesis of CD.
inflammatory bowel disease; whole-exome sequencing; complex disease
PredictProtein is a meta-service for sequence analysis that has been predicting
structural and functional features of proteins since 1992. Queried with a
protein sequence it returns: multiple sequence alignments, predicted aspects of
structure (secondary structure, solvent accessibility, transmembrane helices
(TMSEG) and strands, coiled-coil regions, disulfide bonds and disordered
regions) and function. The service incorporates analysis methods for the
identification of functional regions (ConSurf), homology-based inference of Gene
Ontology terms (metastudent), comprehensive subcellular localization prediction
(LocTree3), protein–protein binding sites (ISIS2),
protein–polynucleotide binding sites (SomeNA) and predictions of the
effect of point mutations (non-synonymous SNPs) on protein function (SNAP2). Our
goal has always been to develop a system optimized to meet the demands of
experimentalists not highly experienced in bioinformatics. To this end, the
PredictProtein results are presented as both text and a series of intuitive,
interactive and visually appealing figures. The web server and sources are
available at http://ppopen.rostlab.org.
Rat strains differ dramatically in their susceptibility to mammary carcinogenesis. On the assumption that susceptibility genes are conserved across mammalian species and hence inform human carcinogenesis, numerous investigators have used genetic linkage studies in rats to identify genes responsible for differential susceptibility to carcinogenesis. Using a genetic backcross between the resistant Copenhagen (Cop) and susceptible Fischer 344 (F344) strains, we mapped a novel mammary carcinoma susceptibility (Mcs30) locus to the centromeric region on chromosome 12 (LOD score of ∼8.6 at the D12Rat59 marker). The Mcs30 locus comprises approximately 12 Mbp on the long arm of rat RNO12 whose synteny is conserved on human chromosome 13q12 to 13q13. After analyzing numerous genes comprising this locus, we identified Fry, the rat ortholog of the furry gene of Drosophila melanogaster, as a candidate Mcs gene. We cloned and determined the complete nucleotide sequence of the 13 kbp Fry mRNA. Sequence analysis indicated that the Fry gene was highly conserved across evolution, with 90% similarity of the predicted amino acid sequence among eutherian mammals. Comparison of the Fry sequence in the Cop and F344 strains identified two non-synonymous single nucleotide polymorphisms (SNPs), one of which creates a putative, de novo phosphorylation site. Further analysis showed that the expression of the Fry gene is reduced in a majority of rat mammary tumors. Our results also suggested that FRY activity was reduced in human breast carcinoma cell lines as a result of reduced levels or mutation. This study is the first to identify the Fry gene as a candidate Mcs gene. Our data suggest that the SNPs within the Fry gene contribute to the genetic susceptibility of the F344 rat strain to mammary carcinogenesis. These results provide the foundation for analyzing the role of the human FRY gene in cancer susceptibility and progression.
An international consortium released the first draft sequence of the human genome 10 years ago. Although the analysis of this data has suggested the genetic underpinnings of many diseases, we have not yet been able to fully quantify the relationship between genotype and phenotype. Thus, a major current effort of the scientific community focuses on evaluating individual predispositions to specific phenotypic traits given their genetic backgrounds. Many resources aim to identify and annotate the specific genes responsible for the observed phenotypes. Some of these use intra-species genetic variability as a means for better understanding this relationship. In addition, several online resources are now dedicated to collecting single nucleotide variants and other types of variants, and annotating their functional effects and associations with phenotypic traits. This information has enabled researchers to develop bioinformatics tools to analyze the rapidly increasing amount of newly extracted variation data and to predict the effect of uncharacterized variants. In this work, we review the most important developments in the field—the databases and bioinformatics tools that will be of utmost importance in our concerted effort to interpret the human variome.
genomic variation; genome interpretation; genomic variant databases; gene prioritization; deleterious variants
In recent years the number of human genetic variants deposited into the publicly available databases has been increasing exponentially. The latest version of dbSNP, for example, contains ~50 million validated Single Nucleotide Variants (SNVs). SNVs make up most of human variation and are often the primary causes of disease. The non-synonymous SNVs (nsSNVs) result in single amino acid substitutions and may affect protein function, often causing disease. Although several methods for the detection of nsSNV effects have already been developed, the consistent increase in annotated data is offering the opportunity to improve prediction accuracy.
Here we present a new approach for the detection of disease-associated nsSNVs (Meta-SNP) that integrates four existing methods: PANTHER, PhD-SNP, SIFT and SNAP. We first tested the accuracy of each method using a dataset of 35,766 disease-annotated mutations from 8,667 proteins extracted from the SwissVar database. The four methods reached overall accuracies of 64%-76% with a Matthew's correlation coefficient (MCC) of 0.38-0.53. We then used the outputs of these methods to develop a machine learning based approach that discriminates between disease-associated and polymorphic variants (Meta-SNP). In testing, the combined method reached 79% overall accuracy and 0.59 MCC, ~3% higher accuracy and ~0.05 higher correlation with respect to the best-performing method. Moreover, for the hardest-to-define subset of nsSNVs, i.e. variants for which half of the predictors disagreed with the other half, Meta-SNP attained 8% higher accuracy than the best predictor.
Here we find that the Meta-SNP algorithm achieves better performance than the best single predictor. This result suggests that the methods used for the prediction of variant-disease associations are orthogonal, encoding different biologically relevant relationships. Careful combination of predictions from various resources is therefore a good strategy for the selection of high reliability predictions. Indeed, for the subset of nsSNVs where all predictors were in agreement (46% of all nsSNVs in the set), our method reached 87% overall accuracy and 0.73 MCC. Meta-SNP server is freely accessible at http://snps.biofold.org/meta-snp.
Disease-causing aberrations in the normal function of a gene define that gene as a disease gene. Proving a causal link between a gene and a disease experimentally is expensive and time-consuming. Comprehensive prioritization of candidate genes prior to experimental testing drastically reduces the associated costs. Computational gene prioritization is based on various pieces of correlative evidence that associate each gene with the given disease and suggest possible causal links. A fair amount of this evidence comes from high-throughput experimentation. Thus, well-developed methods are necessary to reliably deal with the quantity of information at hand. Existing gene prioritization techniques already significantly improve the outcomes of targeted experimental studies. Faster and more reliable techniques that account for novel data types are necessary for the development of new diagnostics, treatments, and cure for many diseases.
Non-synonymous single nucleotide polymorphisms (nsSNPs) alter the protein sequence and can cause disease. The impact has been described by reliable experiments for relatively few mutations. Here, we study predictions for functional impact of disease-annotated mutations from OMIM, PMD and Swiss-Prot and of variants not linked to disease.
Most disease-causing mutations were predicted to impact protein function. More surprisingly, the raw predictions scores for disease-causing mutations were higher than the scores for the function-altering data set originally used for developing the prediction method (here SNAP). We might expect that diseases are caused by change-of-function mutations. However, it is surprising how well prediction methods developed for different purposes identify this link. Conversely, our predictions suggest that the set of nsSNPs not currently linked to diseases contains very few strong disease associations to be discovered.
Firstly, annotations of disease-causing nsSNPs are on average so reliable that they can be used as proxies for functional impact. Secondly, disease-causing nsSNPs can be identified very well by methods that predict the impact of mutations on protein function. This implies that the existing prediction methods provide a very good means of choosing a set of suspect SNPs relevant for disease.
Summary: Many existing databases annotate experimentally characterized single nucleotide polymorphisms (SNPs). Each non-synonymous SNP (nsSNP) changes one amino acid in the gene product (single amino acid substitution;SAAS). This change can either affect protein function or be neutral in that respect. Most polymorphisms lack experimental annotation of their functional impact. Here, we introduce SNPdbe—SNP database of effects, with predictions of computationally annotated functional impacts of SNPs. Database entries represent nsSNPs in dbSNP and 1000 Genomes collection, as well as variants from UniProt and PMD. SAASs come from >2600 organisms; ‘human’ being the most prevalent. The impact of each SAAS on protein function is predicted using the SNAP and SIFT algorithms and augmented with experimentally derived function/structure information and disease associations from PMD, OMIM and UniProt. SNPdbe is consistently updated and easily augmented with new sources of information. The database is available as an MySQL dump and via a web front end that allows searches with any combination of organism names, sequences and mutation IDs.
The discrimination between functionally neutral amino acid substitutions and non-neutral mutations, affecting protein function, is very important for our understanding of diseases. The rapidly growing amounts of experimental data enable the development of computational tools to facilitate the annotation of these substitutions. Here, we describe a Random Forests-based classifier, named Mutation Detector (MuD) that utilizes structural and sequence-derived features to assess the impact of a given substitution on the protein function. In its automatic mode, MuD is comparable to alternative tools in performance. However, the uniqueness of MuD is that user-reported protein-specific structural and functional information can be added at run-time, thereby enhancing the prediction accuracy further. The MuD server, available at http://mud.tau.ac.il, assigns a reliability score to every prediction, thus offering a useful tool for the prioritization of substitutions in proteins with an available 3D structure.
Functionally significant heterozygous mutations in the Melanocortin-4 receptor (MC4R) have been implicated in 2.5% of early onset obesity cases in European cohorts. The role of mutations in this gene in severely obese adults, particularly in smaller North American patient cohorts, has been less convincing. More recently, it has been proposed that mutations in a phylogenetically and physiologically related receptor, the Melanocortin-3 receptor (MC3R), could also be a cause of severe human obesity. The objectives of this study were to determine if mutations impairing the function of MC4R or MC3R were associated with severe obesity in North American adults. We studied MC4R and MC3R mutations detected in a total of 1821 adults (889 severely obese and 932 lean controls) from two cohorts. We systematically and comparatively evaluated the functional consequences of all mutations found in both MC4R and MC3R. The total prevalence of rare MC4R variants in severely obese North American adults was 2.25% (CI95%: 1.44–3.47) compared with 0.64% (CI95%: 0.26–1.43) in lean controls (P < 0.005). After classification of functional consequence, the prevalence of MC4R mutations with functional alterations was significantly greater when compared with controls (P < 0.005). In contrast, the prevalence of rare MC3R variants was not significantly increased in severely obese adults [0.67% (CI95%: 0.27–1.50) versus 0.32% (CI95%: 0.06–0.99)] (P = 0.332). Our results confirm that mutations in MC4R are a significant cause of severe obesity, extending this finding to North American adults. However, our data suggest that MC3R mutations are not associated with severe obesity in this population.
The melanocortin 4 receptor (MC4R) is a G-protein-coupled receptor (GPCR) and a key molecule in the regulation of energy homeostasis. At least 159 substitutions in the coding region of human MC4R (hMC4R) have been described experimentally; over 80 of those occur naturally, and many have been implicated in obesity. However, assessment of the presumably functionally essential residues remains incomplete. Here we have performed a complete in silico mutagenesis analysis to assess the functional essentiality of all possible nonnative point mutants in the entire hMC4R protein (332 residues). We applied SNAP, which is a method for quantifying functional consequences of single amino acid (AA) substitutions, to calculate the effects of all possible substitutions at each position in the hMC4R AA sequence. We compiled a mutability score that reflects the degree to which a particular residue is likely to be functionally important. We performed the same experiment for a paralogue human melanocortin receptor (hMC1R) and a mouse orthologue (mMC4R) in order to compare computational evaluations of highly related sequences. Three results are most salient: 1) our predictions largely agree with the available experimental annotations; 2) this analysis identified several AAs that are likely to be functionally critical, but have not yet been studied experimentally; and 3) the differential analysis of the receptors implicates a number of residues as specifically important to MC4Rs vs. other GPCRs, such as hMC1R.—Bromberg, Y., Overton, J., Vaisse, C., Leibel, R. L., Rost, B. In silico mutagenesis: a case study of the melanocortin 4 receptor.
MC4R; MC1R; SNAP; active functional site; obesity; diabetes
Mutations resulting in the disruption of protein function are the underlying causes of many genetic diseases. Some mutations affect the number of expressed proteins while others alter the activity on a per-molecule basis. Single amino acid substitutions as caused by non-synonymous Single Nucleotide Polymorphisms (nsSNPs) often disrupt function by altering protein structure and/or stability, but can also wreak havoc by directly impacting functional binding sites. Given the experimental three-dimensional (3D) structure of a protein, we can try to differentiate between the "effect on structure/stability" and the "effect on binding". However, experimental 3D structures are available for only 1% of all known proteins; the magnitude of stability change caused by a given mutation is more widely available.
Here, we analyze to which extent the functional effect of a mutation can be predicted from the effect on protein stability. We find that simple sequence-based methods succeed in predicting functional effects of nsSNPs. In fact, such methods consistently outperform approaches that predict functional change through the application of binary thresholds to stability change. We also observed that if stability is affected, functional change is easier to predict than when stability is not affected.
Our results confirmed that stability change is somehow related to function change. However, we also show that the knowledge of stability changes in no way suffices to predict functional changes and that many function changing mutations have no effect on stability.
Motivation: Mutating residues into alanine (alanine scanning) is one of the fastest experimental means of probing hypotheses about protein function. Alanine scans can reveal functional hot spots, i.e. residues that alter function upon mutation. In vitro mutagenesis is cumbersome and costly: probing all residues in a protein is typically as impossible as substituting by all non-native amino acids. In contrast, such exhaustive mutagenesis is feasible in silico.
Results: Previously, we developed SNAP to predict functional changes due to non-synonymous single nucleotide polymorphisms. Here, we applied SNAP to all experimental mutations in the ASEdb database of alanine scans; we identified 70% of the hot spots (≥1 kCal/mol change in binding energy); more severe changes were predicted more accurately. Encouraged, we carried out a complete all-against-all in silico mutagenesis for human glucokinase. Many of the residues predicted as functionally important have indeed been confirmed in the literature, others await experimental verification, and our method is ready to aid in the design of in vitro mutagenesis.
Availability: ASEdb and glucokinase scores are available at http://www.rostlab.org/services/SNAP. For submissions of large/whole proteins for processing please contact the author.
Mutating residues into alanine (alanine scanning) is one of the fastest experimental means of probing hypotheses about protein function. Alanine scans can reveal functional hot spots, i.e. residues that alter function upon mutation. In vitro mutagenesis is cumbersome and costly: probing all residues in a protein is typically as impossible as substituting by all non-native amino acids. In contrast, such exhaustive mutagenesis is feasible in silico.
Previously, we developed SNAP to predict functional changes due to non-synonymous single nucleotide polymorphisms. Here, we applied SNAP to all experimental mutations in the ASEdb database of alanine scans; we identified 70% of the hot spots (≥1kCal/mol change in binding energy); more severe changes were predicted more accurately. Encouraged, we carried out a complete all-against-all in silico mutagenesis for human glucokinase. Many of the residues predicted as functionally important have indeed been confirmed in the literature, others await experimental verification, and our method is ready to aid in the design of in vitro mutagenesis.
ASEdb and glucokinase scores are available at http://www.rostlab.org/services/SNAP. For submissions of large/whole proteins for processing please contact the author.
Summary: Many non-synonymous single nucleotide polymor-phisms (nsSNPs) in humans are suspected to impact protein function. Here, we present a publicly available server implementation of the method SNAP (screening for non-acceptable polymorphisms) that predicts the functional effects of single amino acid substitutions. SNAP identifies over 80% of the non-neutral mutations at 77% accuracy and over 76% of the neutral mutations at 80% accuracy at its default threshold. Each prediction is associated with a reliability index that correlates with accuracy and thereby enables experimentalists to zoom into the most promising predictions.
Availability: Web-server: http://www.rostlab.org/services/SNAP; downloadable program available upon request.
Supplementary information: Supplementary data are available at Bioinformatics online.
In 404 Lepob/ob F2 progeny of a C57BL/6J (B6) x DBA/2J (DBA) intercross, we mapped a DBA-related quantitative trait locus (QTL) to distal Chr1 at 169.6 Mb, centered about D1Mit110, for diabetes-related phenotypes that included blood glucose, HbA1c, and pancreatic islet histology. The interval was refined to 1.8 Mb in a series of B6.DBA congenic/subcongenic lines also segregating for Lepob. The phenotypes of B6.DBA congenic mice include reduced β-cell replication rates accompanied by reduced β-cell mass, reduced insulin/glucose ratio in blood, reduced glucose tolerance, and persistent mild hypoinsulinemic hyperglycemia. Nucleotide sequence and expression analysis of 14 genes in this interval identified a predicted gene that we have designated “Lisch-like” (Ll) as the most likely candidate. The gene spans 62.7 kb on Chr1qH2.3, encoding a 10-exon, 646–amino acid polypeptide, homologous to Lsr on Chr7qB1 and to Ildr1 on Chr16qB3. The largest isoform of Ll is predicted to be a transmembrane molecule with an immunoglobulin-like extracellular domain and a serine/threonine-rich intracellular domain that contains a 14-3-3 binding domain. Morpholino knockdown of the zebrafish paralog of Ll resulted in a generalized delay in endodermal development in the gut region and dispersion of insulin-positive cells. Mice segregating for an ENU-induced null allele of Ll have phenotypes comparable to the B.D congenic lines. The human ortholog, C1orf32, is in the middle of a 30-Mb region of Chr1q23-25 that has been repeatedly associated with type 2 diabetes.
Type 2 diabetes (T2D) accounts for over 90% of instances of diabetes and is a leading cause of medical morbidity and mortality. Twin studies indicate a strong polygenic contribution to susceptibility within the context of obesity. Although approximately ten genes making important contributions to individual risk have been identified, it is clear that others remain to be identified. In this study, we intercrossed obese, diabetes-resistant and diabetes-prone mouse strains to implicate a genetic interval on mouse Chr1 associated with reduced β-cell numbers and elevated blood glucose. We narrowed the region using molecular genetics and computational approaches to identify a novel gene we designated “Lisch-like” (Ll). The orthologous human genetic interval has been repeatedly implicated in T2D. Mice with an induced mutation that reduces Ll expression are impaired in both β-cell development and glucose metabolism, and reduced expression of the homologous gene in zebrafish disrupts islet development. Ll is expressed in organs implicated in the pathophysiology of T2D (hypothalamus, islets, liver, and skeletal muscle) and is predicted to encode a transmembrane protein that could mediate cholesterol transport and/or convey signals related to cell division. Either mechanism could mediate effects on β-cell mass that would predispose to T2D.
Many genetic variations are single nucleotide polymorphisms (SNPs). Non-synonymous SNPs are ‘neutral’ if the resulting point-mutated protein is not functionally discernible from the wild type and ‘non-neutral’ otherwise. The ability to identify non-neutral substitutions could significantly aid targeting disease causing detrimental mutations, as well as SNPs that increase the fitness of particular phenotypes. Here, we introduced comprehensive data sets to assess the performance of methods that predict SNP effects. Along we introduced SNAP (screening for non-acceptable polymorphisms), a neural network-based method for the prediction of the functional effects of non-synonymous SNPs. SNAP needs only sequence information as input, but benefits from functional and structural annotations, if available. In a cross-validation test on over 80 000 mutants, SNAP identified 80% of the non-neutral substitutions at 77% accuracy and 76% of the neutral substitutions at 80% accuracy. This constituted an important improvement over other methods; the improvement rose to over ten percentage points for mutants for which existing methods disagreed. Possibly even more importantly SNAP introduced a well-calibrated measure for the reliability of each prediction. This measure will allow users to focus on the most accurate predictions and/or the most severe effects. Available at http://www.rostlab.org/services/SNAP