Whole genome resequencing experiments are performed to systematically identify genomic variations. For example, the complete genome of a single
bos taurus animal was sequenced to identify millions of previously unknown cattle SNPs [
28]. In another work, artificial mutations that are responsible for phenotypes in
caenorhabditis elegans could be identified thanks to whole genome sequencing [
38]. Distilling the huge quantity of information into meaningful lists of SNPs is a multi-step bioinformatics process.
NovelSNPer is an easy to use tool that helps scientists with the analysis of next-generation sequencing data. Lengthy lists of SNPs from next-generation resequencing projects are efficiently assessed and annotated with the most important SNP features. Of outmost interest is the functional class of an SNP. SNPs involving stop gains (nonsense mutations) should in most cases mediate severe impairment of a protein's functioning. Non-synonymous SNPs can also entail a modification of a protein's conformation depending on how dissimilar the exchanged amino acids are. Whereas these two classes of SNPs can have a more or less direct effect on a protein (see, e.g., [
39]), SNPs in untranslated regions (UTRs), introns, and up- and downstream regions have the potential to alter the binding behaviour of transcription factors or splice factors and, thus, to alter gene expression, indirectly.
Finally, variations in the immediate neighbourhood of exon-intron boundaries could influence the splicing process of a transcript. One important type of genomic variation are frame shifts, that is, disruptions of the reading frame. This variation is mediated by indels of one nucleotide or of a stretch of nucleotides whose length is not a multiple of three into the coding part of a gene.
Two more variation features permit life scientists to assess the severeness of the consequences a variation can have: (a) the protein domain [
26] that is affected by an indel or a nonsynonymous SNP and (b) the difference between observed and expected conservation score [
27]. A great difference between observed and expected conservation score suggests that there is negative selection pressure more than expected to maintain the nucleotide at that position. A variation at such a position should therefore be likely to have negative consequences for the phenotype. Protein domains can be considered as protein building blocks that occur in various proteins within a species and across species. They are more conserved than those parts of the protein that are not organized in protein domains. A non-synonymous SNP that leads to the exchange of an amino acid that is part of a protein domain should therefore have more severe consequences for the phenotype than a non-synonymous SNP that affects an amino acid outside a protein domain.
NovelSNPer is one of the few tools that explicitly classifies variations as
POTENTIAL_START_GAINED or
START_LOST if they mediate between a start codon and a non-start codon in the assembly and the reference sequence. We found such a classification only in the Mouse SNP database by the Center for Genome Dynamics of the Jackson Laboratory
http://cgd.jax.org/cgdsnpdb/.
A reason why other tools or databases are obviously reluctant to use these two classes might be that a start codon alone is not sufficient to predict the translation start for sure. A variation which is classified as
POTENTIAL_START_GAINED is a
potential new start codon that competes with the previously existing start codon (if this has remained unaltered). In contrast, a
START_LOST variation can be considered as unambiguous signal that the translation start is shifted as compared to the wild type. Alternative translation start sites were discussed, for example, in [
30,
31]. The figures that we presented in
Section 3 suggest that the protein synthesis of quite a substantial number of genes can be altered by
POTENTIAL_START_GAINED or START_LOST variations. One percent of the genes appear to be susceptible to protein synthesis suppression altogether.
By querying the Ensembl database online,
NovelSNPer does not need to be maintained regularly. Its great value is furthermore that resequencing data from all species whose genome sequence is available in Ensembl can be processed. By using
NovelSNPer offline it is possible to annotate variations for all species as long as the genome-structure of the transcripts is available as BED file or GFF file [
19,
20].
NovelSNPer completes the functionality of existing tools in many ways: variation calls are checked against the reference sequence to guarantee the correctness of the results, the usage of relative SNP positions is possible, the novel functional classes POTENTIAL_START_GAINED and START_LOST are introduced, heterozygous variation calls can be treated, there is no limit in the number of variations to be processed, and information about the degree of conservation and protein domains is presented. SNPs and indels can be processed alike.