Improved query capabilities
The current SNPnexus web server accepts data submitted in three different forms: genomic position, chromosomal region or dbSNP identifier. Queries can be made in both single and batch mode. As with the previous version, users can annotate a novel variation by providing physical coordinates on the genomic clone (clone, contig or chromosome), its reference and observed alleles, and strand information.
While annotations for known SNPs can still be done by providing the dbSNP rs# number, an additional query feature has been added to annotate all known variants in a given genomic region by simply providing its start and end position on the chromosome. This can be useful when dealing with the reassessment of the functional role of HapMap and dbSNP variants within a given region, for example.
Alongside single base substitutions, SNPnexus has been upgraded to support multiple nucleotide substitutions, insertions and deletions (InDels) covering a wider range of variation data. Users can also use the International Union for Pure and Applied Chemistry (IUPAC) code to denote ambiguous sites in a given DNA sequence motif, where a single character may represent more than one nucleotide. Allowing IUPAC code makes SNPnexus easy to use and complementary to the main genotype-calling algorithms widely used in sequencing projects.
SNPnexus accepts text-based batch query files, where each line corresponds to one genetic variant, either known or novel, including insertions, deletions or single block substitutions. It uses parallel processing to speed-up the calculation for larger queries up to 100
000 variants. In addition to the NCBI36/hg18 human genome assembly, SNPnexus now also supports the latest GRCh37/hg19 release.
With increasing interest in next-generation sequencing to rapidly discover novel variants and sub-select functionally relevant ones, allowing larger queries with improved capabilities through a web-based tool is timely for the research community.
Improved annotation categories
Enriched gene/protein consequences
In addition to the previously supported five major gene reference systems including NCBI RefSeq, Ensembl, Vega, UCSC and AceView, SNPnexus now enables users to compute functional consequences on H-Invitational (16
) and CCDS (17
) genes, thus providing the most extensive information on alternative splicing. For protein-coding transcripts, the predicted functional effect of a variant falls into one of the following categories: coding, splice site, 5′-UTR, 3′-UTR, upstream, downstream, intronic. In addition, SNPnexus now computes whether a variant falls into an exonic, intronic or splice-site region of a non-coding transcript.
For an intronic variant, the distance to the splicing site is reported. For a coding variant, SNPnexus reports the related base pair position within the cDNA and CDS, corresponding amino acid position in the peptide chain, the subsequent amino acid change and reference/altered protein sequences. On top of showing whether a coding variant is synonymous or non-synonymous, we report whether non-synonymous substitutions result in immediate stop-codon gain or loss. In case of an InDel or block substitution occurring within coding region, the occurrence is reported as peptide-shift or frame-shift.
For coding non-synonymous variations, SNPnexus provides the predicted deleterious effect on protein function based on SIFT predictions (18
). Predictions are only shown for complete Ensembl proteins. No predictions are shown for non-synonymous substitutions resulting in stop-gain or stop-loss as these fundamentally change the protein sequence.
Updated HapMap population data
On top of the four HapMap populations available in the previous release of SNPnexus for hg18 assembly [African YRI (from Yoruba in Ibadan, Nigeria), Japanese JPT (from Tokyo, Japan), Han Chinese CHB (from Beijing, China) and European ancestry CEU (from Utah, USA)], the current version incorporates seven additional populations on hg19 assembly: African Ancestry ASW (from SouthWestern USA), Chinese Ancestry CHD (from Metropolitan Denver, USA), Gujarati Indians GIH (from Houston, USA), Luhya LWK (from Webuye, Kenya), Mexican Ancestry MEX (from Los Angeles, USA), Masai MKK (from Kinyawa, Kenya) and Toscani TSI (from Italy). SNPnexus provides both genotype and allele information, related count and frequency.
Updated regulatory data
In addition to annotations for conserved transcription factor binding sites, miRNAs, putative miRNA target sites, predicted 5′-terminal exons/promoters, SNPnexus adds information on two types of regulatory elements: CpG islands (19
) and Vista Enhancers (20
). CpG islands are DNA regions with high G+C content that can influence gene expression and modulate processes such as carcinogenesis (21
). Vista Enhancers are the predicted non-coding distant-acting transcriptional enhancers in the human genome identified as conserved in human, mouse and rat. With SNPnexus, one could quickly investigate whether a variant overlaps with Vista enhancers or CpG islands, therefore potentially altering the transcriptional and post-transcriptional regulation of gene expression.
This is a new addition to SNPnexus, which shows the estimated probability score that a particular variant belongs to a conserved genomic region, based on the multiple alignments of 44/46 vertebrate species using phastCons method from the PHAST package (22
). Focusing on variants that fall in highly conserved genomic regions greatly helps prioritizing important candidate variants to be analysed for disease studies.
Enriched phenotype and disease association
SNPnexus allows users to establish connection with a rich collection of genetic association studies from three sources: the Genetic Association Database (GAD) (13
)—an archive of human genetic association studies of complex diseases and disorders, the Catalogue Of Somatic Mutations In Cancer (COSMIC) (23
)—an online database for somatic mutation information related to human cancers, and the NHGRI genome-wide association study (GWAS) catalogue (24
)—a resource for mining published SNP-trait/disease associations. When investigating the role of variants, users can mine these databases and extract any information related to the gene(s)/variant(s) of interest.
Restructured and enriched structural variation data
SNPnexus has been restructured to locate variants in four types of structural variation regions from the Database of Genomic Variants (25
): Copy number variations (CNVs), insertions/deletions (InDels), inversions and inversion breakpoints. The new version accommodates updated data from a large collection of peer-reviewed research studies.
Data processing from updated sources
SNPnexus is not merely a collection of annotated data sets, rather it utilizes primary annotation data sets from different sources to instantly calculate functional annotations. Primary data sets for most of the annotation categories are collected from UCSC. Currently SNPnexus maintains two separate databases for GRCh37/hg19 and NCBI36/hg18. The UCSC annotation data sets are built from MySQL tables available from the UCSC human genome annotation database (http://hgdownload.cse.ucsc.edu/downloads.html#human
). Reference human genome sequence and public domain SNP details are collected from BioMart release 63 for hg19 (ftp://ftp.ensembl.org/pub/release-63/mysql/
) and release 54 for hg18 (ftp://ftp.ensembl.org/pub/release-54/mysql/
). Pre-computed SIFT predictions for non-synonymous amino acid substitutions in Ensembl proteins are available from the SIFT Human Protein DB release 63 (ftp://ftp.jcvi.org/pub/data/sift/Human_db_37_ensembl_63/
). The Genetic Association Database team provides the link to download GAD data file. COSMIC and miRBase data are downloaded from their corresponding FTP sites (ftp://ftp.sanger.ac.uk/pub/CGP/cosmic
). SNPnexus does not provide annotations for H-invitational gene/protein consequences and SIFT predictions on hg18 and predicted 5′ terminal exons/promoters on hg19 because of the unavailability of data in the corresponding primary data sources.
The ability to connect users' submitted queries to these data sources and compute a wide range of functional annotations on the fly makes SNPnexus a unique, timely and valuable tool.
Improved presentation of the results
The basic output remains the same with main focus on showing as detailed information as possible in the web page and providing links to the related web data sources, if available. For each selected output annotation category, results are shown in separate tables and available for download as tab-delimited text files. The new version also allows all results to be downloaded as an excel file composed of separate worksheets representing selected output annotations. From our experience with gene/protein consequences analysis, users are often interested to get not only the specific amino acid changes but also the altered protein sequences to be used for further investigation. Here, the downloadable text and excel files contain the reference and altered (if non-synonymous) protein sequences for coding variants (not available in the web page). For the gene consequences category, we have an additional graphical representation of the distribution of predicted functional consequences. This is particularly useful for variation analysis within a genomic region, where users could assess the relative functional importance of the region.
The user notification system has been improved as well. Users are no longer required to provide their email address with submitted queries. After submission, users are immediately notified of the current status of the query in the result page and can visit the page any time to check the query status. The results are accessible via the same page once the analysis is completed. Due to the huge amount of data processed every day, the results are kept and made available for 72
hours. If a user provides valid email address, then the notification of acceptance and completion are sent via email.