Single nucleotide polymorphisms (SNPs) and small insertions or deletions (indels) are the most common type of polymorphisms and are frequently used for molecular marker development. Such markers have become very popular for all kinds of genetic analysis, including haplotype reconstruction. Haplotypes can be reconstructed for whole chromosomes but also for specific genes, based on the SNPs present. Haplotypes in the latter context represent the different alleles of a gene. The computational approach to SNP mining is becoming increasingly popular because of the continuously increasing number of sequences deposited in databases, which allows a more accurate identification of SNPs. Several software packages have been developed for SNP mining from databases. From these, QualitySNP is the only tool that combines SNP detection with the reconstruction of alleles, which results in a lower number of false positive SNPs and also works much faster than other programs. We have build a web-based SNP discovery and allele detection tool (HaploSNPer) based on QualitySNP.
HaploSNPer is a flexible web-based tool for detecting SNPs and alleles in user-specified input sequences from both diploid and polyploid species. It includes BLAST for finding homologous sequences in public EST databases, CAP3 or PHRAP for aligning them, and QualitySNP for discovering reliable allelic sequences and SNPs. All possible and reliable alleles are detected by a mathematical algorithm using potential SNP information. Reliable SNPs are then identified based on the reconstructed alleles and on sequence redundancy.
Thorough testing of HaploSNPer (and the underlying QualitySNP algorithm) has shown that EST information alone is sufficient for the identification of alleles and that reliable SNPs can be found efficiently. Furthermore, HaploSNPer supplies a user friendly interface for visualization of SNP and alleles. HaploSNPer is available from .
There is recently great interest in haplotype block structure and haplotype tagging SNPs (htSNPs) in the human genome for its implication on htSNPs-based association mapping strategy for complex disease. Different definitions have been used to characterize the haplotype block structure in the human genome, and several different performance criteria and algorithms have been suggested on htSNPs selection.
A heuristic algorithm, generalized branch-and-bound algorithm, is applied to the searching of minimal set of haplotype tagging SNPs (htSNPs) according to different htSNPs performance criteria. We develop a software htSNPer1.0 to implement the algorithm, and integrate three htSNPs performance criteria and four haplotype block definitions for haplotype block partitioning. It is a software with powerful Graphical User Interface (GUI), which can be used to characterize the haplotype block structure and select htSNPs in the candidate gene or interested genomic regions. It can find the global optimization with only a fraction of the computing time consumed by exhaustive searching algorithm.
htSNPer1.0 allows molecular geneticists to perform haplotype block analysis and htSNPs selection using different definitions and performance criteria. The software is a powerful tool for those focusing on association mapping based on strategy of haplotype block and htSNPs.
Ancestral sequence reconstruction is essential to a variety of evolutionary studies. Here, we present the FastML web server, a user-friendly tool for the reconstruction of ancestral sequences. FastML implements various novel features that differentiate it from existing tools: (i) FastML uses an indel-coding method, in which each gap, possibly spanning multiples sites, is coded as binary data. FastML then reconstructs ancestral indel states assuming a continuous time Markov process. FastML provides the most likely ancestral sequences, integrating both indels and characters; (ii) FastML accounts for uncertainty in ancestral states: it provides not only the posterior probabilities for each character and indel at each sequence position, but also a sample of ancestral sequences from this posterior distribution, and a list of the k-most likely ancestral sequences; (iii) FastML implements a large array of evolutionary models, which makes it generic and applicable for nucleotide, protein and codon sequences; and (iv) a graphical representation of the results is provided, including, for example, a graphical logo of the inferred ancestral sequences. The utility of FastML is demonstrated by reconstructing ancestral sequences of the Env protein from various HIV-1 subtypes. FastML is freely available for all academic users and is available online at http://fastml.tau.ac.il/.
This paper describes a new program SnpSift for filtering differential DNA sequence variants between two or more experimental genomes after genotoxic chemical exposure. Here, we illustrate how SnpSift can be used to identify candidate phenotype-relevant variants including single nucleotide polymorphisms, multiple nucleotide polymorphisms, insertions, and deletions (InDels) in mutant strains isolated from genome-wide chemical mutagenesis of Drosophila melanogaster. First, the genomes of two independently isolated mutant fly strains that are allelic for a novel recessive male-sterile locus generated by genotoxic chemical exposure were sequenced using the Illumina next-generation DNA sequencer to obtain 20- to 29-fold coverage of the euchromatic sequences. The sequencing reads were processed and variants were called using standard bioinformatic tools. Next, SnpEff was used to annotate all sequence variants and their potential mutational effects on associated genes. Then, SnpSift was used to filter and select differential variants that potentially disrupt a common gene in the two allelic mutant strains. The potential causative DNA lesions were partially validated by capillary sequencing of polymerase chain reaction-amplified DNA in the genetic interval as defined by meiotic mapping and deletions that remove defined regions of the chromosome. Of the five candidate genes located in the genetic interval, the Pka-like gene CG12069 was found to carry a separate pre-mature stop codon mutation in each of the two allelic mutants whereas the other four candidate genes within the interval have wild-type sequences. The Pka-like gene is therefore a strong candidate gene for the male-sterile locus. These results demonstrate that combining SnpEff and SnpSift can expedite the identification of candidate phenotype-causative mutations in chemically mutagenized Drosophila strains. This technique can also be used to characterize the variety of mutations generated by genotoxic chemicals.
personal genomes; Drosophila melanogaster; whole-genome SNP analysis; next-generation DNA sequencing
Motivation: Next-generation sequencing captures sequence differences in reads relative to a reference genome or transcriptome, including splicing events and complex variants involving multiple mismatches and long indels. We present computational methods for fast detection of complex variants and splicing in short reads, based on a successively constrained search process of merging and filtering position lists from a genomic index. Our methods are implemented in GSNAP (Genomic Short-read Nucleotide Alignment Program), which can align both single- and paired-end reads as short as 14 nt and of arbitrarily long length. It can detect short- and long-distance splicing, including interchromosomal splicing, in individual reads, using probabilistic models or a database of known splice sites. Our program also permits SNP-tolerant alignment to a reference space of all possible combinations of major and minor alleles, and can align reads from bisulfite-treated DNA for the study of methylation state.
Results: In comparison testing, GSNAP has speeds comparable to existing programs, especially in reads of ≥70 nt and is fastest in detecting complex variants with four or more mismatches or insertions of 1–9 nt and deletions of 1–30 nt. Although SNP tolerance does not increase alignment yield substantially, it affects alignment results in 7–8% of transcriptional reads, typically by revealing alternate genomic mappings for a read. Simulations of bisulfite-converted DNA show a decrease in identifying genomic positions uniquely in 6% of 36 nt reads and 3% of 70 nt reads.
Availability: Source code in C and utility programs in Perl are freely available for download as part of the GMAP package at http://share.gene.com/gmap.
Insertion and deletion variants (indels) within poly glycine tracts of fused in sarcoma (FUS) were initially reported as causative of disease in amyotrophic lateral sclerosis (ALS). Subsequent studies identified similar indels in controls and suggested that these indels may confer susceptibility to ALS. We aimed to elucidate the role of previously published and novel exonic indels in FUS in an extensive cohort of 630 ALS patients and 1063 controls. We detected indels in FUS exons 5, 6, 12 and 14 with similar frequencies in patients (0.95%) and controls (0.75%). Exonic indels in poly glycine tracts were also observed with similar frequencies. The largest indel (p.Gly138_Tyr143del) was observed in one control. In one patient, a 3 base pair deletion in exon 14 (p.Gly475del) was identified, however in-vitro studies did not reveal abnormal localization of p.Gly475del mutant FUS. These findings suggest that not all exonic indels in FUS cause disease.
Summary: AnnTools is a versatile bioinformatics application designed for comprehensive annotation of a full spectrum of human genome variation: novel and known single-nucleotide substitutions (SNP/SNV), short insertions/deletions (INDEL) and structural variants/copy number variation (SV/CNV). The variants are interpreted by interrogating data compiled from 15 constantly updated sources. In addition to detailed functional characterization of the coding variants, AnnTools searches for overlaps with regulatory elements, disease/trait associated loci, known segmental duplications and artifact prone regions, thereby offering an integrated and comprehensive analysis of genomic data. The tool conveniently accepts user-provided tracks for custom annotation and offers flexibility in input data formats. The output is generated in the universal Variant Call Format. High annotation speed makes AnnTools suitable for high-throughput sequencing facilities, while a low-memory footprint and modest CPU requirements allow it to operate on a personal computer. The application is freely available for public use; the package includes installation scripts and a set of helper tools.
Supplementary data are available at Bioinformatics online.
Motivation: High-throughput sequencing technologies have made population-scale studies of human genetic variation possible. Accurate and comprehensive detection of DNA sequence variants is crucial for the success of these studies. Small insertions and deletions represent the second most frequent class of variation in the human genome after single nucleotide polymorphisms (SNPs). Although several alignment tools for the gapped alignment of sequence reads to a reference genome are available, computational methods for discriminating indels from sequencing errors and genotyping indels directly from sequence reads are needed.
Results: We describe a probabilistic method for the accurate detection and genotyping of short indels from population-scale sequence data. In this approach, aligned sequence reads from a population of individuals are used to automatically account for context-specific sequencing errors associated with indels. We applied this approach to population sequence datasets from the 1000 Genomes exon pilot project generated using the Roche 454 and Illumina sequencing platforms, and were able to detect a significantly greater number of indels than reported previously. Comparison to indels identified in the 1000 Genomes pilot project demonstrated the sensitivity of our method. The consistency in the number of indels and the fraction of indels whose length is a multiple of three across different human populations and two different sequencing platforms indicated that our method has a low false discovery rate. Finally, the method represents a general approach for the detection and genotyping of small-scale DNA sequence variants for population-scale sequencing projects.
Availability: A program implementing this method is available at http://polymorphism.scripps.edu/~vbansal/software/piCALL/
Supplementary information: Supplementary data are available at Bioinformatics online.
Recent advances in sequencing technology make it possible to comprehensively catalogue genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (1) initial read mapping; (2) local realignment around indels; (3) base quality score recalibration; (4) SNP discovery and genotyping to find all potential variants; and (5) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We discuss the application of these tools, instantiated in the Genome Analysis Toolkit (GATK), to deep whole-genome, whole-exome capture, and multi-sample low-pass (~4×) 1000 Genomes Project datasets.
The resolution of genome-wide association studies (GWAS) is limited by the linkage disequilibrium (LD) structure of the population being studied. Selecting the most likely causal variants within an LD block is relatively straightforward within coding sequence, but is more difficult when all variants are intergenic. Predicting functional non-coding sequence has been recently facilitated by the availability of conservation and epigenomic information. We present HaploReg, a tool for exploring annotations of the non-coding genome among the results of published GWAS or novel sets of variants. Using LD information from the 1000 Genomes Project, linked SNPs and small indels can be visualized along with their predicted chromatin state in nine cell types, conservation across mammals and their effect on regulatory motifs. Sets of SNPs, such as those resulting from GWAS, are analyzed for an enrichment of cell type-specific enhancers. HaploReg will be useful to researchers developing mechanistic hypotheses of the impact of non-coding variants on clinical phenotypes and normal variation. The HaploReg database is available at http://compbio.mit.edu/HaploReg.
The detection of single nucleotide polymorphisms (SNPs) and insertion/deletions (indels) with precision from high-throughput data remains a significant bioinformatics challenge. Accurate detection is necessary before next-generation sequencing can routinely be used in the clinic. In research, scientific advances are inhibited by gaps in data, exemplified by the underrepresented discovery of rare variants, variants in non-coding regions and indels. The continued presence of false positives and false negatives prevents full automation and requires additional manual verification steps. Our methodology presents applications of both pattern recognition and sensitivity analysis to eliminate false positives and aid in the detection of SNP/indel loci and genotypes from high-throughput data. We chose FK506-binding protein 51(FKBP5) (6p21.31) for our clinical target because of its role in modulating pharmacological responses to physiological and synthetic glucocorticoids and because of the complexity of the genomic region. We detected genetic variation across a160 kb region encompassing FKBP5. 613 SNPs and 57 indels, including a 3.3 kb deletion were discovered. We validated our method using three independent data sets and, with Sanger sequencing and Affymetrix and Illumina microarrays, achieved 99% concordance. Furthermore we were able to detect 267 novel rare variants and assess linkage disequilibrium. Our results showed both a sensitivity and specificity of 98%, indicating near perfect classification between true and false variants. The process is scalable and amenable to automation, with the downstream filters taking only 1.5 hours to analyze 96 individuals simultaneously. We provide examples of how our level of precision uncovered the interactions of multiple loci, their predicted influences on mRNA stability, perturbations of the hsp90 binding site, and individual variation in FKBP5 expression. Finally we show how our discovery of rare variants may change current conceptions of evolution at this locus.
pattern recognition; next-generation sequencing analysis; indels; rare variants; FKBP5; HLA
The ABO blood groups result from DNA sequence variations, predominantly single nucleotide and insertion/deletion polymorphisms (SNPs and indels), in the ABO gene encoding a glycosyltransferase. The ABO blood groups A1, A2, B and O predominantly result from the wild type allele A1 and the major gene variants that are characterized by four diallelic markers (261G>del, 802G>A, 803G>C, 1061C>del). Here, we were interested to evaluate the impact of ABO genotyping compared to ABO phenotyping in paternity testing.
The major ABO alleles were determined by PCR amplification with sequence-specific primers (PCR-SSP) in a representative sample of 1,335 blood donors. The genotypes were compared to the ABO blood groups registered in the blood donor files. Then, the ABO phenotypes and genotypes were determined in 95 paternity trio cases that have been investigated by 12 short tandem repeat (STR) markers before. We compared statistical parameters (PL, paternity likelihood; PE, power of exclusion) of both blood grouping approaches.
The prevalence of the major ABO alleles and genotypes corresponded to the expected occurrence of ABO blood groups in a Caucasian population. The low resolution genotyping of 4 diallelic markers revealed a correct genotype-phenotype correlation in 1,331 of 1,335 samples (99.7%). In 60 paternity trios with confirmed paternity of the alleged father based on STR analysis both PL and PE of the ABO genotype was significantly higher than of the ABO phenotype. In 12 of 35 exclusion cases (34.3%) the ABO genotype also excluded the alleged father, whereas the ABO phenotype excluded the alleged father only in 7 cases (20%).
In paternity testing ABO genotyping is superior to ABO phenotyping with regard to PL and PE, however, ABO genotyping is not sufficient for valid paternity testing. Due to the much lower mutation rate compared to STR markers, blood group SNPs in addition to anonymous SNPs could be considered for future kinship analysis and genetic identity testing.
ABO genotyping; Paternity likelihood; Power of exclusion; Paternity trios
With ever-increasing numbers of microbial genomes being sequenced, efficient tools are needed to perform strain-level identification of any newly sequenced genome. Here, we present the SNP identification for strain typing (SNIT) pipeline, a fast and accurate software system that compares a newly sequenced bacterial genome with other genomes of the same species to identify single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels). Based on this information, the pipeline analyzes the polymorphic loci present in all input genomes to identify the genome that has the fewest differences with the newly sequenced genome. Similarly, for each of the other genomes, SNIT identifies the input genome with the fewest differences. Results from five bacterial species show that the SNIT pipeline identifies the correct closest neighbor with 75% to 100% accuracy. The SNIT pipeline is available for download at http://www.bhsai.org/snit.html
Several ways of incorporating indels into phylogenetic analysis have been suggested. Simple indel coding has two strengths: (1) biological realism and (2) efficiency of analysis. In the method, each indel with different start and/or end positions is considered to be a separate character. The presence/absence of these indel characters is then added to the data set.
We have written a program, GapCoder to automate this procedure. The program can input PIR format aligned datasets, find the indels and add the indel-based characters. The output is a NEXUS format file, which includes a table showing what region each indel characters is based on. If regions are excluded from analysis, this table makes it easy to identify the corresponding indel characters for exclusion.
Manual implementation of the simple indel coding method can be very time-consuming, especially in data sets where indels are numerous and/or overlapping. GapCoder automates this method and is therefore particularly useful during procedures where phylogenetic analyses need to be repeated many times, such as when different alignments are being explored or when various taxon or character sets are being explored. GapCoder is currently available for Windows from .
Primary ciliary dyskinesia (PCD) is a genetic disorder, usually autosomal recessive, causing early respiratory disease and later subfertility. Whole exome sequencing may enable efficient analysis for locus heterogeneous disorders such as PCD. We whole exome sequenced one consanguineous Saudi Arabian with clinically diagnosed PCD and normal laterality, to attempt ab initio molecular diagnosis.
We reviewed thirteen known PCD genes and potentially autozygous regions (extended homozygosity) for homozygous exon deletions, non-dbSNP codon, splice-site base variants or small indels. Homozygous non-dbSNP changes were also reviewed exome-wide.
One single molecular read representing RSPH9 p.Lys268del was observed, with no wildtype reads, and a notable deficiency of mapped reads at this location. Among all observations, RSPH9 was the strongest candidate for causality. Searching unmapped reads revealed seven more mutant reads. Direct assay for p.Lys268del (MboII digest) confirmed homozygosity in the affected individual, then confirmed homozygosity in three siblings with bronchiectasis. Our finding in southwest Saudi Arabia indicates that p.Lys268del, previously observed in two Bedouin families (Israel, UAE) is geographically widespread in the Arabian Peninsula. Analogous with cystic fibrosis CFTR p.Phe508del, screening for RSPH9 p.Lys268del (which lacks sentinel dextrocardia) in those at risk would help in early diagnosis, tailored clinical management, genetic counselling and primary prevention.
high-throughput nucleotide sequencing; primary ciliary dyskinesia; screening
A high-throughput method for genotyping by mapping InDels. This method has been used to create fragment-length polymorphism maps for Drosophila and C. elegans.
Small insertions or deletions (InDels) constitute a ubiquituous class of sequence polymorphisms found in eukaryotic genomes. Here, we present an automated high-throughput genotyping method that relies on the detection of fragment-length polymorphisms (FLPs) caused by InDels. The protocol utilizes standard sequencers and genotyping software. We have established genome-wide FLP maps for both Caenorhabditis elegans and Drosophila melanogaster that facilitate genetic mapping with a minimum of manual input and at comparatively low cost.
Indels are an important cause of human variation and central to the study of human disease. The 1000 Genomes Project Low-Coverage Pilot identified over 1.3 million indels shorter than 50 bp, of which over 890 were identified as potentially disruptive variants. Yet, despite their ubiquity, the local genomic characteristics of indels remain unexplored.
Herein we describe population- and minor allele frequency-based differences in linkage disequilibrium and imputation characteristics for indels included in the 1000 Genomes Project Low-Coverage Pilot for the CEU, YRI and CHB+JPT populations. Common indels were well tagged by nearby SNPs in all studied populations, and were also tagged at a similar rate to common SNPs. Both neutral and functionally deleterious common indels were imputed with greater than 95% concordance from HapMap Phase 3 and OMNI SNP sites. Further, 38 to 56% of low frequency indels were tagged by low frequency SNPs. We were able to impute heterozygous low frequency indels with over 50% concordance. Lastly, our analysis also revealed evidence of ascertainment bias. This bias prevents us from extending the applicability of our results to highly polymorphic indels that could not be identified in the Low-Coverage Pilot.
Although further scope exists to improve the imputation of low frequency indels, our study demonstrates that there are already ample opportunities to retrospectively impute indels for prior genome-wide association studies and to incorporate indel imputation into future case/control studies.
With the influx of various SNP genotyping assays in recent years, there has been a need for an assay that is robust, yet cost effective, and could be performed using standard gel-based procedures. In this context, CAPS markers have been shown to meet these criteria. However, converting SNPs to CAPS markers can be a difficult process if done manually. In order to address this problem, we describe a computer program, SNP2CAPS, that facilitates the computational conversion of SNP markers into CAPS markers. 413 multiple aligned sequences derived from barley ESTs were analysed for the presence of polymorphisms in 235 distinct restriction sites. 282 (90%) of 314 alignments that contain sequence variation due to SNPs and InDels revealed at least one polymorphic restriction site. After reducing the number of restriction enzymes from 235 to 10, 31% of the polymorphic sites could still be detected. In order to demonstrate the usefulness of this tool for marker development, we experimentally validated some of the results predicted by SNP2CAPS.
Mining genetic variation from personal genomes is a crucial step towards investigating the relationship between genotype and phenotype. However, compared to the detection of SNPs and small indels, characterizing large and particularly complex structural variation is much more difficult and less intuitive. In this article, we present a new scheme (inGAP-sv) to detect and visualize structural variation from paired-end mapping data. Under this scheme, abnormally mapped read pairs are clustered based on the location of a gap signature. Several important features, including local depth of coverage, mapping quality and associated tandem repeat, are used to evaluate the quality of predicted structural variation. Compared with other approaches, it can detect many more large insertions and complex variants with lower false discovery rate. Moreover, inGAP-sv, written in Java programming language, provides a user-friendly interface and can be performed in multiple operating systems. It can be freely accessed at http://ingap.sourceforge.net/.
Summary: A tool to predict the effect that newly discovered genomic variants have on known transcripts is indispensible in prioritizing and categorizing such variants. In Ensembl, a web-based tool (the SNP Effect Predictor) and API interface can now functionally annotate variants in all Ensembl and Ensembl Genomes supported species.
Availability: The Ensembl SNP Effect Predictor can be accessed via the Ensembl website at http://www.ensembl.org/. The Ensembl API (http://www.ensembl.org/info/docs/api/api_installation.html for installation instructions) is open source software.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Allele and haplotype segregation at the polymorphic sites within the promoter (23indel) and intron 1 (12indel) regions of the PRNP gene was analyzed in Polish Holstein–Friesian cattle. More 23del/del homozygotes and fewer 23ins/ins homozygotes than expected were observed in the offspring of ♂ 23ins/del × ♀ 23ins/del parents. In the offspring of ♂ 23ins/del × ♀ 23del/del parents and ♂ 23del/del × ♀ 23ins/del parents, a trend toward more 23del/del animals and fewer 23ins/del animals than expected was noted. At the 12indel polymorphic site, the only trend found was one toward fewer 12ins/ins genotypes and more 12ins/del and 12del/del genotypes than expected in the offspring of ♂ 12ins/del × ♀ 12ins/del parents. An analysis of haplotype segregation revealed more 23del-12del/23del-12del diplotypes and fewer 23ins-12ins/23ins-12ins diplotypes at the significance threshold than expected in the offspring of ♂ 23ins-12ins/23del-12del × ♀ 23ins-12ins/23del-12del parents.
Allele and haplotype segregation; Indel polymorphisms; Holstein–Friesian cattle; PRNP gene
SNPServer is a real-time flexible tool for the discovery of SNPs (single nucleotide polymorphisms) within DNA sequence data. The program uses BLAST, to identify related sequences, and CAP3, to cluster and align these sequences. The alignments are parsed to the SNP discovery software autoSNP, a program that detects SNPs and insertion/deletion polymorphisms (indels). Alternatively, lists of related sequences or pre-assembled sequences may be entered for SNP discovery. SNPServer and autoSNP use redundancy to differentiate between candidate SNPs and sequence errors. For each candidate SNP, two measures of confidence are calculated, the redundancy of the polymorphism at a SNP locus and the co-segregation of the candidate SNP with other SNPs in the alignment. SNPServer is available at .
Tocopherols are a class of four natural compounds that can provide nutrition and function as antioxidant in both plants and animals. Maize kernels have low α-tocopherol content, the compound with the highest vitamin E activity, thus, raising the risk of vitamin E deficiency in human populations relying on maize as their primary vitamin E source. In this study, two insertion/deletions (InDels) within a gene encoding γ-tocopherol methyltransferase, Zea mays VTE4 (ZmVTE4), and a single nucleotide polymorphism (SNP) located ∼85 kb upstream of ZmVTE4 were identified to be significantly associated with α-tocopherol levels in maize kernels by conducting an association study with a panel of ∼500 diverse inbred lines. Linkage analysis in three populations that segregated at either one of these three polymorphisms but not at the other two suggested that the three polymorphisms could affect α-tocopherol content independently. Furthermore, we found that haplotypes of the two InDels could explain ∼33% of α-tocopherol variation in the association panel, suggesting ZmVTE4 is a major gene involved in natural phenotypic variation of α-tocopherol. One of the two InDels is located within the promoter region and associates with ZmVTE4 transcript level. This information can not only help in understanding the underlying mechanism of natural tocopherol variations in maize kernels, but also provide valuable markers for marker-assisted breeding of α-tocopherol content in maize kernels, which will then facilitate the improvement of maize as a better source of daily vitamin E nutrition.
The direct sequencing of PCR products generates heterozygous base-calling fluorescence chromatograms that are useful for identifying single-nucleotide polymorphisms (SNPs), insertion-deletions (indels), short tandem repeats (STRs), and paralogous genes. Indels and STRs can be easily detected using the currently available Indelligent or ShiftDetector programs, which do not search reference sequences. However, the detection of other genomic variants remains a challenge due to the lack of appropriate tools for heterozygous base-calling fluorescence chromatogram data analysis. In this study, we developed a free web-based program, Mixed Sequence Reader (MSR), which can directly analyze heterozygous base-calling fluorescence chromatogram data in .abi file format using comparisons with reference sequences. The heterozygous sequences are identified as two distinct sequences and aligned with reference sequences. Our results showed that MSR may be used to (i) physically locate indel and STR sequences and determine STR copy number by searching NCBI reference sequences; (ii) predict combinations of microsatellite patterns using the Federal Bureau of Investigation Combined DNA Index System (CODIS); (iii) determine human papilloma virus (HPV) genotypes by searching current viral databases in cases of double infections; (iv) estimate the copy number of paralogous genes, such as β-defensin 4 (DEFB4) and its paralog HSPDP3.
A database for potato genome data (PoMaMo, Potato Maps and More) was established. The database contains molecular maps of all twelve potato chromosomes with about 1000 mapped elements, sequence data, putative gene functions, results from BLAST analysis, SNP and InDel information from different diploid and tetraploid potato genotypes, publication references, links to other public databases like GenBank (http://www.ncbi.nlm.nih.gov/) or SGN (Solanaceae Genomics Network, http://www.sgn.cornell.edu/), etc. Flexible search and data visualization interfaces enable easy access to the data via internet (https://gabi.rzpd.de/PoMaMo.html). The Java servlet tool YAMB (Yet Another Map Browser) was designed to interactively display chromosomal maps. Maps can be zoomed in and out, and detailed information about mapped elements can be obtained by clicking on an element of interest. The GreenCards interface allows a text-based data search by marker-, sequence- or genotype name, by sequence accession number, gene function, BLAST Hit or publication reference. The PoMaMo database is a comprehensive database for different potato genome data, and to date the only database containing SNP and InDel data from diploid and tetraploid potato genotypes.