|Home | About | Journals | Submit | Contact Us | Français|
Although mutation analysis serves as a key part in making a definitive diagnosis about a genetic disease, it still remains a time-consuming step to interpret their biological implications through integration of various lines of archived information about genes in question. To expedite this evaluation step of disease-causing genetic variations, here we developed Mutation@A Glance (http://rapid.rcai.riken.jp/mutation/), a highly integrated web-based analysis tool for analysing human disease mutations; it implements a user-friendly graphical interface to visualize about 40 000 known disease-associated mutations and genetic polymorphisms from more than 2600 protein-coding human disease-causing genes. Mutation@A Glance locates already known genetic variation data individually on the nucleotide and the amino acid sequences and makes it possible to cross-reference them with tertiary and/or quaternary protein structures and various functional features associated with specific amino acid residues in the proteins. We showed that the disease-associated missense mutations had a stronger tendency to reside in positions relevant to the structure/function of proteins than neutral genetic variations. From a practical viewpoint, Mutation@A Glance could certainly function as a ‘one-stop’ analysis platform for newly determined DNA sequences, which enables us to readily identify and evaluate new genetic variations by integrating multiple lines of information about the disease-causing candidate genes.
Genetic diseases are caused by structural changes in genes and/or chromosomes. In the Online Mendelian Inheritance in Man (OMIM, http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim) database, more than 2200 genes are known to have mutations causing genetic diseases.1 For instance, primary immunodeficiency diseases (PIDs) are caused by congenital defects in genes involved in the development and maintenance of the immune system,2,3 and they can be diagnosed using mutation analysis that identifies pathogenic mutations in candidate PID genes. This process plays a critical role in improving the quality of life for PID patients.4 In this regard, the recent advances in DNA sequencing technology will extremely expedite this process. Thus, the next bottleneck to be addressed is obviously how to clarify the associations between newly identified patient-specific genetic variations and disease phenotypes, even when familial disease history is absent. To eliminate the bottleneck in mutation analysis, we need a bioinformatics tool that would enable us to readily evaluate the impact of a genetic variation on the structure/function of a gene product at the molecular level. Towards this end, our first step was to develop an integrated ‘one-stop’ analysis platform where we could cross-reference multiple lines of information regarding known genetic variations, including a huge amount of non-synonymous (ns) single-nucleotide polymorphisms (nsSNPs) in healthy individuals,5–7 in genes of interest.
Bioinformatics resources and methods played an indispensable role in creating this platform.8–12 Although a number of databases regarding reported human disease mutations and SNPs have been already constructed,13–25 these databases were launched as a static archive for genetic variation data, not necessarily an interactive tool for evaluating newly identified sequence variation data. Several computational algorithms for predicting the effects of ns substitutions on a corresponding protein have been developed using evolutionary and protein three-dimensional (3D) structure information.26–31 However, despite public availability of these software/web servers, there are at least two hurdles, especially for clinical researchers to exploit them for the mutation analysis: (i) since these servers usually require information about the position of the genetic variation occurred in a submitted sequence as a query input, the users have to specify the variation position in the sequence before submitting the query; (ii) since these servers do not necessarily incorporate known disease-associated mutation data into their systems, the users have to manually compare their newly identified genetic variation data from patients with previously reported data. Thus, we thought it was important to integrate predictive bioinformatics tools, such as the one described above, with a comprehensive set of known genetic variation data, to create a ‘one-stop’ mutation analysis platform.32
In this context, here we present Mutation@A Glance (http://rapid.rcai.riken.jp/mutation/), a new web-based integrated bioinformatics tool for analysing mutations from human genetic diseases. The user-friendly graphical interface of Mutation@A Glance makes it possible to allocate known disease-associated mutation data on the nucleotide and amino acid sequences of a gene of interest, and to link these mutation data to the 3D structure of the gene product along with various lines of information about the mutated amino acid residues (e.g. the extent of evolutional sequence conservation, post-translational modifications and molecular interactions). Furthermore, this tool enables users to identify and evaluate newly identified sequence variations in a query DNA sequence from a gene of interest by comparing them with known disease-associated mutation data and using the SIFT program,26 which is one of the most accurate and widely used program to specifically predict the effects of ns substitutions based on evolutionary information for each residue position.33 Therefore, Mutation@A Glance surely serves as a ‘one-stop’ informational platform to identify and evaluate new genetic variations by integrating multiple lines of information about the disease-causing candidate genes.
Human disease-associated mutation data were obtained from the following three databases: OMIM (http://www.ncbi.nlm.nih.gov/omim/),1 UniProt (http://www.uniprot.org/)34 and RAPID (http://rapid.rcai.riken.jp/).17 Sequence variations that were associated with OMIM in the dbSNP database (Build 130, http://www.ncbi.nlm.nih.gov/projects/SNP/)18 were considered to be disease-associated mutations and other variations were considered non-disease associated. For the mutation data in the UniProt database, VARIANT features associated with diseases in the human entries were considered. RAPID is a molecular database that we have recently established for reported disease mutation data in genes causing PIDs.17 The RAPID database is directly connected to our local server and the mutation data (as of August 2009) are retrieved using a Perl script. The human genome sequence (Build 36.3), RefSeq sequences for nucleotides and proteins of human were downloaded from the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov/). Information regarding residue-wise functional features (Transmembrane helix, signal peptide, nucleotide binding, disulphide bond, metal binding, active site and post-translational modification site) was extracted from the human entries in the UniProt database. Information regarding the exon–intron structures of each gene was downloaded from the NCBI ftp site.
Homologous protein sequences in other organisms to the human proteins encoded by disease-causing genes were identified using the BLAST program35 against the RefSeq database (6 691 817 amino acid sequences) with a cut-off E-value of 10−4. If the sequence identity and the coverage between a sequence hit and the human were higher than 40% and 80%, respectively, the sequence was selected as a homologous sequence. When two or more sequences from an organism were found as homologous sequences, the sequence with the highest sequence identity was only considered. The homologous protein sequences from various organisms were aligned using the CLUSTAL W program.36 A degree of sequence conservation at each amino acid position in the multiple sequence alignment (simply designated as ‘residue conservation’ in Fig. 1) was defined as the ratio of (the number of the homologous protein sequences which carried an identical amino acid residue to that in the human sequence) to (the number of the aligned homologous protein sequences) at the specified position in the multiple sequence alignment. For example, if Ala appears in an aligned position in the human sequence and the corresponding positions in all of the other homologous sequences are also Ala, the residue conservation in this position is defined as 1.0. The frequency distribution of the residue conservations in disease-associated missense mutation or nsSNP positions for proteins analysed in this study was represented using bins of the interval of 0.2. The value in each bin was normalized by the frequency of the total number of residues in each bin.
Comparison of frequency distributions of residue conservations in disease-associated missense mutations and nsSNPs. The vertical axis depicts the log-odds ratio of the frequency of ns substitution residue positions (disease-associated mutations or nsSNPs) ...
Protein 3D structure data were downloaded from the Protein Data Bank (PDB, http://www.rcsb.org/pdb/).37 In cases where the 3D structure of a human protein had not yet been determined, we searched the available sequences in the PDB entries for a template structure for homology modelling using the BLAST program as described above. When the alignment of the human protein sequence and a known 3D structure showed >30% identity and >90% coverage, a homology model was built using the MODELLER package.38 For each target, 20 model structures were generated and their reliabilities were assessed with the Discrete Optimized Protein Energy (DOPE) method.39 Eventually, the model with the best DOPE score was selected as the final model for each protein. Information about protein quaternary structures was also extracted from the PDB database. Entries from the PDB that contained information about the biological unit structure and entries with polypeptide chains showed >85% identities with a human protein sequence were considered. When a distance of one atom in a residue in a given polypeptide chain was <5.0 Å from that of another residue in the other polypeptide or nucleotide chain, the residue was considered to be located at a molecular interaction interface.
The solvent accessibilities of the amino acid residues in a 3D modelled structure were calculated using a modification of the Shrake and Rupley method,40 with a water molecule represented by a 1.4 Å radius sphere. The solvent accessibility is represented by values ranging from 0 to 1. The residue was considered as an exposed residue on the protein surface, if the solvent accessibility was >0.25 and buried otherwise.
We used the DISOPRED2 program41 to analyse each amino acid sequence of a gene product and predict intrinsically unstructured (disordered) regions in the protein sequence. If the program predicted a region consisting of more than three amino acid residues in a sequence to be ‘disordered’, we assigned this region as an intrinsically unstructured one.
The effects of ns substitutions on a given protein were evaluated on a local server using the SIFT program26 which predicts the effects of missense substitutions on a protein based on evolutionary information from homologous protein sequences.
From three data resources for human disease mutations, OMIM, UniProt and RAPID, we obtained 25 616 disease-associated mutations and 21 199 nsSNPs in 2656 human genes (Table 1) and integrated into the local database. Functional classification of the proteins encoded by the disease-associated genes showed a wide variety of molecular functions such as metabolic enzymes, protein kinases, transcription factor/regulators and structural proteins (Table 1 and Supplementary Table S1). Because we have been actively analysing mutations found in patients of PIDs with paediatricians in Japan, we constructed RAPID and used it as our original data resource for genetic variations in genes responsible for PIDs.17 RAPID contains manually curated mutation data from published literature, including nonsense (582 sites in 96 genes), frame-shift (851 sites in 101 genes) and insertion/deletion (85 sites in 42 genes) mutations as well as missense mutations (1564 sites in 116 genes) in the protein-coding regions of 155 PID genes (as of August 2009). For non-PID genes, we used two publicly available data sets from UniProt and OMIM. The UniProt database contains only missense mutation data (22 258 entries in 2614 genes). On the other hand, the OMIM database contains a large number of missense mutation (1899 entries in 556 genes) and a relatively small number of the other types of mutations (99 entries in 13 genes). The RAPID and the OMIM databases also contain 699 disease-associated mutation data in intronic regions of 147 genes that cause splice anomaly effects. Thus, the most frequent mutation type in our data sets was missense mutation (89% of the total entry) as reported in the previous study.13
In general, disease-associated missense mutations tend to occur at evolutionarily conserved positions, because these positions are usually essential for the structure and/or function of a protein.26,42,43 To verify this using the up-dated data set, we compared the frequency of disease-associated missense mutation sites (19 128 unique positions in 2622 genes) in each residue conservation bin with that of nsSNP sites (20 605 positions in 2494 genes) (Fig. 1). The results indicated that the previously reported tendency was still true for the 2622 genes in our data set; the disease-associated mutation sites were preferably appeared in the highest residue conservation bin, while nsSNP sites showed the opposite trend (Fig. 1). Next, we cross-referenced amino acid positions of the disease-associated missense mutations and nsSNPs to the functional features and 3D structures of the protein data in Mutation@A Glance. We classified these positions in terms of their functional features in a protein (annotated in the UniProt databases; Table 2). More disease-associated missense mutations were found in the positions annotated to have some functional features, except in the ‘signal peptides’ and ‘post-translational modification sites’, than nsSNPs. Using a homology modelling technique, we mapped 10 939 out of 19 128 disease-associated mutation sites (57.2%) to protein 3D structures (Fig. 2). Of these sites, 6616 sites (60.4%) were located in regions buried in protein structures (solvent accessibility <0.25). In the same way, 7106 out of 20 605 nsSNP sites (34.4%) were mapped to 3D structures, and 4258 sites (59.9%) were located on the surfaces of proteins (Fig. 2A). This observation is basically consistent with the previous findings from structural analysis.44–46 Interestingly, nsSNP sites were located in regions predicted as intrinsically disordered at a three times higher frequency than disease-associated mutation sites (Fig. 2A). This might be ascribed to the observation that conservation in the intrinsically disordered regions is relatively lower than that in ordered regions.47
Classification of disease-associated mutations and nsSNPs according to their location on protein 3D structure. (A) The numbers in the pie charts depict those of ns substitution positions. (B) Proportion of ns substitution positions in the disease-associated ...
Proteins function with other molecules in molecular networks (e.g. signalling pathways) in many cases. Hence, the effects of mutations on molecular interactions must be intriguing in mutation analysis.48 We thus analysed whether or not the missense mutation positions were located in the molecular interaction sites based on the quaternary protein structures available from the PDB. Consequently, 714 out of 1738 disease-associated mutation sites (41.1%) were found to locate at the interfaces of 474 distinct proteins known to be involved in protein complex structures (Fig. 2B; see Section 2.3). In contrast, the same was true for only 346 out of 1128 nsSNP sites (30.7%) in 447 genes. We confirmed that the frequency of disease-associated mutation sites located at the molecular interaction interface was significantly higher than that of nsSNP sites by χ2 test (P < 0.01). These results implicated that ns substitutions at positions involved in the molecular interaction tend to be disease-related as we expected.
Figure 3 shows the front page of the Mutation@A Glance website. It has two types of query forms, for visualizing known disease mutation data (Fig. 3A) and for evaluating novel genetic variations in query DNA sequences (Fig. 3B). For the visualization, a user inputs a given gene symbol of interest in the form. When the user enters some characters in the form, a list of gene names containing the input character string is shown to assist the user input. In addition, a user can also search for the gene name of interest from an entire list of genes available in Mutation@A Glance, which is displayed by clicking ‘Select from List’ button (Fig. 3A). Just as information for users, the mutation data set used for each gene is noted near the ‘Select from List’ button. Figure 4 shows sample screenshots for the STAT3 gene, which is known to be causative to hyper-IgE syndrome (HIES).49,50 At the DNA level, positions of the disease-associated mutations, including substitution, insertion and deletion, as well as SNPs are shown on a set of exon sequences or genomic DNA sequence with/without the open-reading frame for the gene of interest (Fig. 4A). If two or more alternative transcripts exist in the RefSeq database, the genetic variation data are allocated on the reference sequence that encodes the longest amino acid sequence among the alternative transcripts whereas all the alternative transcripts are indicated in the top panel of the genomic structure. At the protein level, the disease-associated mutation and SNP sites are highlighted in the primary structure of the gene products along with available functional annotation information of the amino acid residues from the UniProt database (e.g. enzymatic active sites and post-translational modification sites, etc.) (Fig. 4B). Information regarding conserved domain from Pfam (http://pfam.sanger.ac.uk/)51 and predicted intrinsically disordered regions are also displayed. When 3D structure information for the protein is available, the positions of mutation and SNP data can be viewed on the monomer or complex 3D structures with the Jmol applet (Fig. 4B). Detailed information about nucleotide or amino acid residues of interest is displayed in another window after clicking on a residue (Fig. 4C). In particular, at the protein level, an amino acid residue becomes highlighted in the 3D structure when clicking on it (Fig. 4B). The amino acid sequence of human can be compared with those of other organisms by clicking ‘Multiple Alignment’ button (Fig. 4D). The representation of the 3D structure can be selected from two model types (ribbon or space-filling models) and three colouring types (by rainbow, highlighting mutation positions or residue conservation) (Fig. 4E). The ‘External Links’ button provides links to NCBI Entrez Gene (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene)52 for general information regarding the gene, Human Protein Reference Database (http://www.hprd.org/)53 for information about the gene product, GeneCards (http://www.genecards.org/),54 the Reference Database of Immune Cells (http://refdic.rcai.riken.jp/)55 for gene expression profiling data and the KEGG pathway (http://www.genome.jp/)56 for pathways involving this gene (Fig. 4F). By using this visualization facility, mapping amino acid positions of known ns substitutions on the crystal structure of the STAT3–DNA complex (PDB code: 1bg1)57 revealed that the disease-associated missense mutation residue positions were spatially located at the interface of the homodimer or at the DNA binding site, whereas the nsSNP residue positions were located on a surface outside of the molecular interaction sites (Fig. 5). This suggests that disease-causative missense mutations in STAT3 directly affect the protein–protein and/or protein–DNA interaction as reported previously.49,50 This is a good demonstration how Mutation@A Glance could help us interpret mutation effects at the molecular level.
The front page of Mutation@A Glance. There are two types of query interface for (A) browsing known mutation data and (B) evaluating novel sequence variations in DNA sequences of interest. See the main text for details of the mutation data available in ...
Screenshots of Mutation@A Glance. An example of visualizing mutation data for STAT3 is shown at the DNA (A) and the protein levels (B). The nucleotide and amino acid positions of disease-associated mutations and SNPs are coloured magenta and green, respectively. ...
Spatial localization of disease-associated missense mutation sites on the STAT3 protein structure. Two STAT3 subunits are represented as a space-filling model coloured white (subunit A) and a ribbon model coloured pink (subunit B), respectively. A double-stranded ...
One of the issues of diagnosis of genetic diseases is how to evaluate the pathogenicity of newly identified sequence variations. To address this issue, Mutation@A Glance has an interface that allows clinical researchers to assess the impact of an observed sequence variation in a given DNA sequence for a candidate disease-causing gene as the second query form (Fig. 3B). When a user submits DNA sequences of a candidate gene in question, this tool returns a list of sequence variations found in the input DNA sequences at both the DNA and the protein levels (Fig. 6). To identify genetic variations that occur in input DNA sequences of a given gene, the BLAT program58 is implemented to align the input DNA sequences with the reference genomic DNA sequence for the corresponding gene. Figure 6A represents the alignment status of the query sequence to the reference sequence. If a sequence variation is found, multiple lines of detailed information about the variation, such as the variation types (e.g. substitution, insertion and deletion), the mutated region (e.g. exon, intron and 5′- or 3′-splice sites constituting the GT-AG rule), the amino acid changes (e.g. missense, nonsense, insertion/deletion and frame-shift), the known variation data (disease-associated mutation and SNP) and structure/function features of the position at the protein level, are displayed based on the reference human genome sequence in the public database (Fig. 6B). Sequence alignments between the query and reference sequences are also displayed (Fig. 6C). If a ns substitution is found in the query DNA sequence, it was evaluated by the SIFT program26 (incorporated in the local system), which predicts whether amino acid substitutions in a protein will be ‘Deleterious’ or ‘Tolerated’ using evolutionary information from the homologous proteins (Fig. 6B). We tested the prediction accuracy of SIFT with our data sets of disease-associated mutations and non-disease-associated nsSNPs, and found that the false-negative rate (falsely predicted as ‘Tolerated’ for disease-associated mutations) and the false-positive rate (falsely predicted as ‘Deleterious’ for nsSNPs) were 25% and 39%, respectively. These accuracy values were comparable to those evaluated in previous study.33 The current version of Mutation@A Glance does not implement a method for quantitative evaluation of mutation effects on RNA splicing, mainly because we considered the evaluation method is not matured enough yet. However, because the evaluation of mutation effects on RNA splicing/stability is very intriguing, we will place a high priority on the implementation of the evaluation tool for genetic variations affecting RNA splicing/stability in the future development.
An example of evaluating sequence variations in query STAT3 DNA sequences. (A) The mapping status of each query sequence to the reference sequence is shown. (B) If a variation is found in the query sequence, the detailed information is shown for each ...
There are several advantages of Mutation@A Glance over other existing web servers for evaluating the effects of mutations. First, users are only required to have DNA sequences from a particular gene as their input and thus do not need to pre-process their submission data; other websites for evaluating the mutation effects require a list of genetic variations as a query, not raw sequence data.26–31 Secondly, Mutation@A Glance identifies and addresses multiple types of sequence variations (e.g. insertion/deletion, frame-shifts) from input query DNA sequences whereas the other web servers do not. Thirdly, newly identified genetic variations can be easily compared with known mutation and SNP data using the graphical visualization interface of Mutation@A Glance (Fig. 6D).
From a viewpoint of clinical use, it is obvious that any mutation analysis platform cannot serve as a useful one without reliable mutation data sets. However, whereas large amounts of disease-associated mutation data for various genetic diseases have been reported, most of them are dispersed and stored locally. Only a few websites, e.g. OMIM and UniProt, integrate disease-associated mutation data and allow us to download their contents. However, the mutation data in such databases have a relatively low integrity in terms of updating and coverage. Thus, we have begun to comprehensively collect and manually curate the disease-associated mutation data from published literature focusing on PIDs and established a resource of PID research for clinical use, named RAPID.17 Mutation@A Glance thus uses these manually curated data sets for over 150 PID genes in the RAPID database, which is solid enough for clinical use at least for PID analysis. To make Mutation@A Glance a reliable and general mutation analysis platform for other various genetic diseases in the future, we consider that data sharing with experts in particular diseases will be highly important as in the case of PID; otherwise it would take a long time to accumulate extensive mutation data of all human disease genes to an acceptable level for clinical use. In fact, similar efforts along this direction have been being made by the research community.19
As new technologies for determining genetic variation in humans have rapidly and continuously emerged (such as next generation DNA sequencing), amounts of genetic variation data of human are exponentially growing.6,7,59 Therefore, we will continue to update and improve the Mutation@A Glance system, in order to cope with the larger-scale data analysis for more comprehensive identification of disease-causative candidate genes. Implementing API programs into Mutation@A Glance for query submissions and a retrieval system through command line scripts would be more convenient for this purpose.
In summary, Mutation@A Glance provides a highly integrated bioinformatics tool for mutation analysis not only for facilitating visualization of sequence variation data along with various types of information, including primary and tertiary structures of the gene products, but also for evaluating the effects of novel sequence variations in a query input DNA sequence. This tool works solely on a web browser through Internet and is open to the public. Hence, Mutation@A Glance can be used as a ‘one-stop’ integrated bioinformatics platform for analysing genotype–phenotype relationships of genetic diseases from molecular as well as clinical perspectives.
This work is supported by a Grant-in-aid for Special Coordination Funds for Promoting Science and Technology, from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
The authors would like to thank Drs Shigeaki Nonoyama, Kohsuke Imai, Hirokazu Kanegane, Toshio Miyawaki, Koichi Oshima, Fumihiko Ishikawa and Reiko Kikuno-Fukaya for their critical suggestions about this work.
Edited by Katsumi Isono