|Home | About | Journals | Submit | Contact Us | Français|
Understanding how genetic variation affects the molecular function of gene products is an emergent area of bioinformatic research. Here, we present updates to MutDB (http://www.mutdb.org), a tool aiming to aid bioinformatic studies by integrating publicly available databases of human genetic variation with molecular features and clinical phenotype data. MutDB, first developed in 2002, integrates annotated SNPs in dbSNP and amino acid substitutions in Swiss-Prot with protein structural information, links to scores that predict functional disruption and other useful annotations. Though these functional annotations are mainly focused on nonsynonymous SNPs, some information on other SNP types included in dbSNP is also provided. Additionally, we have developed a new functionality that facilitates KEGG pathway visualization of genes containing SNPs and a SNP query tool for visualizing and exporting sets of SNPs that share selected features based on certain filters.
Understanding how coding single nucleotide polymorphisms (cSNPs) and disease-associated mutations cause molecular alterations and expression changes in gene products is important to many fields of biological and medical research (1,2). We believe that linking disease with basic research data will enable hypothesis generation that can be experimentally tested in the laboratory with functional assays.
Recently, several servers and databases aiming to understand the biochemical effects of nonsynonymous SNPs and disease-associated mutation have been developed. These include SIFT (3), PolyPhen (4), SNPs3D (5), PANTHER (6), PMUT (7), LS-SNP (8), PolyDoms (9) and SNPEffect (10). These methods and their resulting datasets generally apply DNA and protein sequence, protein structure and/or evolutionary features to classify a query amino acid substitution using a training set of putative neutral and causative amino acid substitutions (4,5,8,11–17).
Similarly, MutDB (18,19) is an online resource that serves as a step toward better understanding the potential molecular effects of a mutation. MutDB integrates genetic variation from two public databases, Swiss-Prot (20) and dbSNP (21), and annotates the variants with biochemically relevant information. These two databases are chosen because they are freely available and represent a significant breadth of available amino acid substitutions. However, neither of these databases annotates disease causing amino acid substitutions particularly well. dbSNP contains few links to OMIM (22), and Swiss-Prot does not identify disease causing amino acid substitutions from other amino acid substitutions. Therefore, a researcher studying a specific disease should have some prior knowledge of the proteins and mutations of interest, and MutDB provides some helpful links to useful databases with disease and phenotype annotations such as OMIM and dbGAP (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap), (22).
The SNP and mutation data is parsed directly from Swiss-Prot (currently build 51.0) and dbSNP (currently build 126) without curation, other than to remove any annotations that do not map to the wild-type amino acid in the referenced sequence. The gene model provided by MutDB is organized, using gene information extracted from a local copy of the UCSC Human Genome Annotation Database (ver. Hg18, http://genome.ucsc.edu/) (25). We also use Ensembl (ver. 41_36c, http://www.ensembl.org/) (26) for some gene name cross-references. We attempt to keep pages organized by Entrez Gene ID with the most representative transcript as the primary gene page. Other known mRNA transcripts annotated in the UCSC Genome Annotation Database are listed at the bottom of the page with their annotations. This data may be browsed alphabetically by gene symbol or by employing one of several search methods, including keyword, gene symbol, protein or Refseq ID, and individual variant identifier. Each gene is given its own page for display, providing a list of related SNPs and mutations classified by their effects on the protein product, as well as a pictorial representation of the sequence including points of conservation, location of exons and location of variants. Links to corresponding Swiss-Prot and dbSNP pages, a short description of the gene, and the related chromosome name are supplied.
Each variant is annotated within its own page providing further details, which includes the protein sequence, if known, and any related Protein Data Bank (PDB) (http://www.rcsb.org/pdb/) (27) structures, KEGG Pathways, HapMap data and Entrez Gene information. We describe important aspects of our annotation pipeline below.
Protein structural mapping for each amino acid substitution is performed by aligning the query sequence to each high scoring segment pair (HSP) from BLAST (28) search results using BioPerl scripts (29). BLAST results used for alignment come only from PDB (using a sequence data file downloaded in January of 2007) and are limited to those with 100% identity to the original sequence. These pairwise alignments are then used to map wild-type and mutation sequence to structure sequence. The annotated mutations that are mapped to a structure can be displayed using the integrated Jmol visualization tool (http://jmol.sourceforge.net/) or in extensions developed for UCSF Chimera (30) and Delano Scientific PyMOL (http://pymol.sourceforge.net/). To download the extensions visit http://lifescienceweb.org/.
We provide links to other tools that provide predictions of functional or molecular disruptions caused by an amino acid substitution. These include SNPs3D (5), PolyPhen (4), SIFT (3), PolyDoms (9), PMUT (7) and PANTHER (6) and are deep linked directly to the gene or SNP page, if available. Sorting Intolerant from Tolerant (SIFT) scores (3) and their associated predictions are supplied for each variant causing an amino acid substitution. Variants with low confidence scores are marked with an asterisk. Here, again, the source Swiss-Prot and dbSNP pages are linked.
We have augmented MutDB annotations with KEGG pathways using KEGG web services (23). This enables visualizing proteins, mutations and pathways on approximately 188 human pathways found in KEGG. The addition of a link, ‘Visualize Pathways’, on the MutDB gene page takes the user to a page listing the names of all KEGG pathways involving the gene. When a pathway is chosen, the user is taken to a new page displaying the pathway and a list of involved genes and their associated phenotypes.
All genes containing a SNP denoted as having a disease annotation or comment (per Swiss-Prot) are colored yellow in the pathway. This page is also hyper-linked to KEGG and MutDB. This functionality makes use of KEGG SOAP-based web services with supplementary data saved locally (Figure 2).
A recent addition to our toolset is a SNP query tool that enables querying and exporting sets of SNPs that share selected features. The SNP query tool requires two sequence-tagged site (STS) markers or dbSNP reference cluster IDs (rs#) as input and returns all SNPs between the markers. The tool uses AJAX and a paging scheme to increase responsiveness upon large data sets. AJAX enhances speed by exchanging small amounts of data with the server, so the entire web page need not be reloaded each time the user makes a change. This technique along with the broad filtering options provide for an interactive tool.
Users may filter SNPs by manual selection or one of the filtering criteria. There are currently eleven filter options: validation status in dbSNP, hapmap status, location (functional class), avHet (average heterozygosity in dbSNP), avHetSE (SE for the average heterozygosity in dbSNP), CEU (CEPH—Utah residents with ancestry from northern and western Europe frequencies in HapMap), CHB (Han Chinese in Beijing, China), JPT (Japanese in Tokyo, Japan), YRI (Yoruba in Ibadan, Nigeria), SIFT score (3) and conservation score [based on the UCSC Genome Annotation Database conservation (25)]. The conservation score is the averaged 10-mer window of conservation values around each SNP derived from alignments of the 16 vertebrate species in the UCSC Annotated Genome Database.
A user can authenticate to enter the tool or visit as a guest, and may save each session and return later. Retrieval of sequence surrounding the SNP and exportation of SNP data to Microsoft Excel are easily performed via provided links. Excel output includes the dbSNP rsID, primer sequences, and the polymorphic alleles. The tool displays a PNG image containing RefSeq transcript information and location information for all selected SNPs indexed by function type using the UCSC Genome Annotation Database. A user may also visualize linkage disequilibrium for up to 200 selected SNPs in a Haploview (24) like structure. The SNP query tool is located at http://www.mutdb.org/snp and is linked from each page (Figure 1).
MutDB continues to support its SOAP-based web services. The web services can be accessed via http://www.lifescienceweb.org. This interface is used to communicate to the structural visualization extensions for UCSF Chimera and Delano Scientific PyMOL.
In MutDB, the most accessed genes may give insight into the current interests of researchers. The most accessed genes from October 2005 to January 2007 are listed in Table 1. Not surprisingly, the most accessed genes also have many mutations associated with them and are what we would consider to be well-studied disease-associated genes.
Understanding the underlying molecular causes of disease remains an important area for research. We continue to investigate annotations that are useful for hypothesis generation and directing experimental validation. While we continue to update the database as new annotations become available, we are also adding useful annotations outside of protein amino acid changes such as noncoding, synonymous and intronic variation.
We would like to thank Shoji Ichikawa and Somying Promso for helpful comments on the SNP query tool. We are supported by NLM K22LM009135 (PI: Mooney), P01AG018397 (PI: Econs), a grant from IU Biomedical Research Council, an RSFG grant from IUPUI, the Showalter Trust and the Indiana Genomics Initiative. The Indiana Genomics Initiative (INGEN) is supported in part by the Lilly Endowment. RH is supported by Indiana Pervasive Computing Research (IPCRES) Initiative. Funding to pay the Open Access publication charges for this article was provided by NLM K22LM009135 (PI: Mooney).
Conflict of interest statement. None declared.