Single nucleotide polymorphisms (SNPs) and small insertions or deletions (indels) are the most common type of polymorphisms and are frequently used for molecular marker development. Such markers have become very popular for all kinds of genetic analysis, including haplotype reconstruction. Haplotypes can be reconstructed for whole chromosomes but also for specific genes, based on the SNPs present. Haplotypes in the latter context represent the different alleles of a gene. The computational approach to SNP mining is becoming increasingly popular because of the continuously increasing number of sequences deposited in databases, which allows a more accurate identification of SNPs. Several software packages have been developed for SNP mining from databases. From these, QualitySNP is the only tool that combines SNP detection with the reconstruction of alleles, which results in a lower number of false positive SNPs and also works much faster than other programs. We have build a web-based SNP discovery and allele detection tool (HaploSNPer) based on QualitySNP.
HaploSNPer is a flexible web-based tool for detecting SNPs and alleles in user-specified input sequences from both diploid and polyploid species. It includes BLAST for finding homologous sequences in public EST databases, CAP3 or PHRAP for aligning them, and QualitySNP for discovering reliable allelic sequences and SNPs. All possible and reliable alleles are detected by a mathematical algorithm using potential SNP information. Reliable SNPs are then identified based on the reconstructed alleles and on sequence redundancy.
Thorough testing of HaploSNPer (and the underlying QualitySNP algorithm) has shown that EST information alone is sufficient for the identification of alleles and that reliable SNPs can be found efficiently. Furthermore, HaploSNPer supplies a user friendly interface for visualization of SNP and alleles. HaploSNPer is available from .
There is recently great interest in haplotype block structure and haplotype tagging SNPs (htSNPs) in the human genome for its implication on htSNPs-based association mapping strategy for complex disease. Different definitions have been used to characterize the haplotype block structure in the human genome, and several different performance criteria and algorithms have been suggested on htSNPs selection.
A heuristic algorithm, generalized branch-and-bound algorithm, is applied to the searching of minimal set of haplotype tagging SNPs (htSNPs) according to different htSNPs performance criteria. We develop a software htSNPer1.0 to implement the algorithm, and integrate three htSNPs performance criteria and four haplotype block definitions for haplotype block partitioning. It is a software with powerful Graphical User Interface (GUI), which can be used to characterize the haplotype block structure and select htSNPs in the candidate gene or interested genomic regions. It can find the global optimization with only a fraction of the computing time consumed by exhaustive searching algorithm.
htSNPer1.0 allows molecular geneticists to perform haplotype block analysis and htSNPs selection using different definitions and performance criteria. The software is a powerful tool for those focusing on association mapping based on strategy of haplotype block and htSNPs.
Ancestral sequence reconstruction is essential to a variety of evolutionary studies. Here, we present the FastML web server, a user-friendly tool for the reconstruction of ancestral sequences. FastML implements various novel features that differentiate it from existing tools: (i) FastML uses an indel-coding method, in which each gap, possibly spanning multiples sites, is coded as binary data. FastML then reconstructs ancestral indel states assuming a continuous time Markov process. FastML provides the most likely ancestral sequences, integrating both indels and characters; (ii) FastML accounts for uncertainty in ancestral states: it provides not only the posterior probabilities for each character and indel at each sequence position, but also a sample of ancestral sequences from this posterior distribution, and a list of the k-most likely ancestral sequences; (iii) FastML implements a large array of evolutionary models, which makes it generic and applicable for nucleotide, protein and codon sequences; and (iv) a graphical representation of the results is provided, including, for example, a graphical logo of the inferred ancestral sequences. The utility of FastML is demonstrated by reconstructing ancestral sequences of the Env protein from various HIV-1 subtypes. FastML is freely available for all academic users and is available online at http://fastml.tau.ac.il/.
This paper describes a new program SnpSift for filtering differential DNA sequence variants between two or more experimental genomes after genotoxic chemical exposure. Here, we illustrate how SnpSift can be used to identify candidate phenotype-relevant variants including single nucleotide polymorphisms, multiple nucleotide polymorphisms, insertions, and deletions (InDels) in mutant strains isolated from genome-wide chemical mutagenesis of Drosophila melanogaster. First, the genomes of two independently isolated mutant fly strains that are allelic for a novel recessive male-sterile locus generated by genotoxic chemical exposure were sequenced using the Illumina next-generation DNA sequencer to obtain 20- to 29-fold coverage of the euchromatic sequences. The sequencing reads were processed and variants were called using standard bioinformatic tools. Next, SnpEff was used to annotate all sequence variants and their potential mutational effects on associated genes. Then, SnpSift was used to filter and select differential variants that potentially disrupt a common gene in the two allelic mutant strains. The potential causative DNA lesions were partially validated by capillary sequencing of polymerase chain reaction-amplified DNA in the genetic interval as defined by meiotic mapping and deletions that remove defined regions of the chromosome. Of the five candidate genes located in the genetic interval, the Pka-like gene CG12069 was found to carry a separate pre-mature stop codon mutation in each of the two allelic mutants whereas the other four candidate genes within the interval have wild-type sequences. The Pka-like gene is therefore a strong candidate gene for the male-sterile locus. These results demonstrate that combining SnpEff and SnpSift can expedite the identification of candidate phenotype-causative mutations in chemically mutagenized Drosophila strains. This technique can also be used to characterize the variety of mutations generated by genotoxic chemicals.
personal genomes; Drosophila melanogaster; whole-genome SNP analysis; next-generation DNA sequencing
Motivation: Next-generation sequencing captures sequence differences in reads relative to a reference genome or transcriptome, including splicing events and complex variants involving multiple mismatches and long indels. We present computational methods for fast detection of complex variants and splicing in short reads, based on a successively constrained search process of merging and filtering position lists from a genomic index. Our methods are implemented in GSNAP (Genomic Short-read Nucleotide Alignment Program), which can align both single- and paired-end reads as short as 14 nt and of arbitrarily long length. It can detect short- and long-distance splicing, including interchromosomal splicing, in individual reads, using probabilistic models or a database of known splice sites. Our program also permits SNP-tolerant alignment to a reference space of all possible combinations of major and minor alleles, and can align reads from bisulfite-treated DNA for the study of methylation state.
Results: In comparison testing, GSNAP has speeds comparable to existing programs, especially in reads of ≥70 nt and is fastest in detecting complex variants with four or more mismatches or insertions of 1–9 nt and deletions of 1–30 nt. Although SNP tolerance does not increase alignment yield substantially, it affects alignment results in 7–8% of transcriptional reads, typically by revealing alternate genomic mappings for a read. Simulations of bisulfite-converted DNA show a decrease in identifying genomic positions uniquely in 6% of 36 nt reads and 3% of 70 nt reads.
Availability: Source code in C and utility programs in Perl are freely available for download as part of the GMAP package at http://share.gene.com/gmap.
Insertion and deletion variants (indels) within poly glycine tracts of fused in sarcoma (FUS) were initially reported as causative of disease in amyotrophic lateral sclerosis (ALS). Subsequent studies identified similar indels in controls and suggested that these indels may confer susceptibility to ALS. We aimed to elucidate the role of previously published and novel exonic indels in FUS in an extensive cohort of 630 ALS patients and 1063 controls. We detected indels in FUS exons 5, 6, 12 and 14 with similar frequencies in patients (0.95%) and controls (0.75%). Exonic indels in poly glycine tracts were also observed with similar frequencies. The largest indel (p.Gly138_Tyr143del) was observed in one control. In one patient, a 3 base pair deletion in exon 14 (p.Gly475del) was identified, however in-vitro studies did not reveal abnormal localization of p.Gly475del mutant FUS. These findings suggest that not all exonic indels in FUS cause disease.
Summary: AnnTools is a versatile bioinformatics application designed for comprehensive annotation of a full spectrum of human genome variation: novel and known single-nucleotide substitutions (SNP/SNV), short insertions/deletions (INDEL) and structural variants/copy number variation (SV/CNV). The variants are interpreted by interrogating data compiled from 15 constantly updated sources. In addition to detailed functional characterization of the coding variants, AnnTools searches for overlaps with regulatory elements, disease/trait associated loci, known segmental duplications and artifact prone regions, thereby offering an integrated and comprehensive analysis of genomic data. The tool conveniently accepts user-provided tracks for custom annotation and offers flexibility in input data formats. The output is generated in the universal Variant Call Format. High annotation speed makes AnnTools suitable for high-throughput sequencing facilities, while a low-memory footprint and modest CPU requirements allow it to operate on a personal computer. The application is freely available for public use; the package includes installation scripts and a set of helper tools.
Supplementary data are available at Bioinformatics online.
Motivation: High-throughput sequencing technologies have made population-scale studies of human genetic variation possible. Accurate and comprehensive detection of DNA sequence variants is crucial for the success of these studies. Small insertions and deletions represent the second most frequent class of variation in the human genome after single nucleotide polymorphisms (SNPs). Although several alignment tools for the gapped alignment of sequence reads to a reference genome are available, computational methods for discriminating indels from sequencing errors and genotyping indels directly from sequence reads are needed.
Results: We describe a probabilistic method for the accurate detection and genotyping of short indels from population-scale sequence data. In this approach, aligned sequence reads from a population of individuals are used to automatically account for context-specific sequencing errors associated with indels. We applied this approach to population sequence datasets from the 1000 Genomes exon pilot project generated using the Roche 454 and Illumina sequencing platforms, and were able to detect a significantly greater number of indels than reported previously. Comparison to indels identified in the 1000 Genomes pilot project demonstrated the sensitivity of our method. The consistency in the number of indels and the fraction of indels whose length is a multiple of three across different human populations and two different sequencing platforms indicated that our method has a low false discovery rate. Finally, the method represents a general approach for the detection and genotyping of small-scale DNA sequence variants for population-scale sequencing projects.
Availability: A program implementing this method is available at http://polymorphism.scripps.edu/~vbansal/software/piCALL/
Supplementary information: Supplementary data are available at Bioinformatics online.
Taking advantage of the deep targeted sequencing capabilities of next generation sequencers, we have developed a novel two step insertion deletion (indel) detection algorithm (IDA) that can determine indels from single read sequences with high computational efficiency and sensitivity when indels are fractionally less compared to wild type reference sequence. First, it identifies candidate indel positions utilizing specific sequence alignment artifacts produced by rapid alignment programs. Second, it confirms the location of the candidate indel by using the Smith-Waterman (SW) algorithm on a restricted subset of Sequence reads. We demonstrate that IDA is applicable to indels of varying sizes from deep targeted sequencing data at low fractions where the indel is diluted by wild type sequence. Our algorithm is useful in detecting indel variants present at variable allelic frequencies such as may occur in heterozygotes and mixed normal-tumor tissue.
Peach [Prunus persica (L.) Batsch] is an economically important fruit crop that has become a genetic-genomic model for all Prunus species in the family Rosaceae. A doubled haploid reference genome sequence length of 227.3 Mb, a narrow genetic base contrasted by a wide phenotypic variability, the generation of cultivars through hybridization with subsequent clonal propagation, and the current accessibility of many founder genotypes, as well as the pedigree of modern commercial cultivars make peach a model for the study of inter-cultivar genomic heterogeneity and its shaping by artificial selection.
The quantitative genomic differences among the three genotypes studied as genomic variants, included small variants (SNPs and InDels) and structural variants (SV) (duplications, inversions and translocations). The heirloom cultivar 'Georgia Belle’ and an almond by peach introgression breeding line 'F8,1-42’ are more heterogeneous than is the modern cultivar 'Dr. Davis’ when compared to the peach reference genome ('Lovell’). A pair-wise comparison of consensus genome sequences with 'Lovell’ showed that 'F8,1-42’ and 'Georgia Belle’ were more divergent than were 'Dr. Davis’ and 'Lovell’.
A novel application of emerging bioinformatics tools to the analysis of ongoing genome sequencing project outputs has led to the identification of a range of genomic variants. Results can be used to delineate the genomic and phenotypic differences among peach genotypes. For crops such as fruit trees, the availability of old cultivars, breeding selections and their pedigrees, make them suitable models for the study of genome shaping by artificial selection. The findings from the study of such genomic variants can then elucidate the control of pomological traits and the characterization of metabolic pathways, thus facilitating the development of protocols for the improvement of Prunus crops.
Electronic supplementary material
The online version of this article (doi: 10.1186/1471-2164-14-750) contains supplementary material, which is available to authorized users.
The baboon is an invaluable model for the study of human health and disease, including many complex diseases of the kidney. Although scientists have made great progress in developing this animal as a model for numerous areas of biomedical research, genomic resources for the baboon, such as a quality annotated genome, are still lacking. To this end, we characterized the baboon kidney transcriptome using high-throughput cDNA sequencing (RNA-Seq) to identify genes, gene variants, single nucleotide polymorphisms (SNPs), insertion-deletion polymorphisms (InDels), cellular functions, and key pathways in the baboon kidney to provide a genomic resource for the baboon. Analysis of our sequencing data revealed 45,499 high-confidence SNPs and 29,813 InDels comparing baboon cDNA sequences with the human hg18 reference assembly and identified 35,900 cDNAs in the baboon kidney, including 35,150 transcripts representing 15,369 genic genes that are novel for the baboon. Gene ontology analysis of our sequencing dataset also identified numerous biological functions and canonical pathways that were significant in the baboon kidney, including a large number of metabolic pathways that support known functions of the kidney. The results presented in this study catalogues the transcribed mRNAs, noncoding RNAs, and hypothetical proteins in the baboon kidney and establishes a genomic resource for scientists using the baboon as an experimental model.
This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 h to process 50-fold whole genome sequencing (∼750 million 100 bp paired-end reads), or just 25 min for 210-fold whole exome sequencing. BALSA’s speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa.
Secondary analysis; Whole-genome seqeuncing; Whole-exome sequencing; GPU; Variant calling; Genomics; NGS; HPC
Recent advances in sequencing technology make it possible to comprehensively catalogue genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (1) initial read mapping; (2) local realignment around indels; (3) base quality score recalibration; (4) SNP discovery and genotyping to find all potential variants; and (5) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We discuss the application of these tools, instantiated in the Genome Analysis Toolkit (GATK), to deep whole-genome, whole-exome capture, and multi-sample low-pass (~4×) 1000 Genomes Project datasets.
The Total Integrated Archive of short-Read and Array (TIARA; http://tiara.gmi.ac.kr) database stores and integrates human genome data generated from multiple technologies including next-generation sequencing and high-resolution comparative genomic hybridization array. The TIARA genome browser is a powerful tool for the analysis of personal genomic information by exploring genomic variants such as SNPs, indels and structural variants simultaneously. As of September 2012, the TIARA database provides raw data and variant information for 13 sequenced whole genomes, 16 sequenced transcriptomes and 33 high resolution array assays. Sequencing reads are available at a depth of ∼30× for whole genomes and 50× for transcriptomes. Information on genomic variants includes a total of ∼9.56 million SNPs, 23 025 of which are non-synonymous SNPs, and ∼1.19 million indels. In this update, by adding high coverage sequencing of additional human individuals, the TIARA genome database now provides an extensive record of rare variants in humans. Following TIARA’s fundamentally integrative approach, new transcriptome sequencing data are matched with whole-genome sequencing data in the genome browser. Users can here observe, for example, the expression levels of human genes with allele-specific quantification. Improvements to the TIARA genome browser include the intuitive display of new complex and large-scale data sets.
The resolution of genome-wide association studies (GWAS) is limited by the linkage disequilibrium (LD) structure of the population being studied. Selecting the most likely causal variants within an LD block is relatively straightforward within coding sequence, but is more difficult when all variants are intergenic. Predicting functional non-coding sequence has been recently facilitated by the availability of conservation and epigenomic information. We present HaploReg, a tool for exploring annotations of the non-coding genome among the results of published GWAS or novel sets of variants. Using LD information from the 1000 Genomes Project, linked SNPs and small indels can be visualized along with their predicted chromatin state in nine cell types, conservation across mammals and their effect on regulatory motifs. Sets of SNPs, such as those resulting from GWAS, are analyzed for an enrichment of cell type-specific enhancers. HaploReg will be useful to researchers developing mechanistic hypotheses of the impact of non-coding variants on clinical phenotypes and normal variation. The HaploReg database is available at http://compbio.mit.edu/HaploReg.
Population of the State of Kuwait is composed of three genetic subgroups of inferred Persian, Saudi Arabian tribe and Bedouin ancestry. The Saudi Arabian tribe subgroup traces its origin to the Najd region of Saudi Arabia. By sequencing two whole genomes and thirteen exomes from this subgroup at high coverage (>40X), we identify 4,950,724 Single Nucleotide Polymorphisms (SNPs), 515,802 indels and 39,762 structural variations. Of the identified variants, 10,098 (8.3%) exomic SNPs, 139,923 (2.9%) non-exomic SNPs, 5,256 (54.3%) exomic indels, and 374,959 (74.08%) non-exomic indels are ‘novel’. Up to 8,070 (79.9%) of the reported novel biallelic exomic SNPs are seen in low frequency (minor allele frequency <5%). We observe 5,462 known and 1,004 novel potentially deleterious nonsynonymous SNPs. Allele frequencies of common SNPs from the 15 exomes is significantly correlated with those from genotype data of a larger cohort of 48 individuals (Pearson correlation coefficient, 0.91; p <2.2×10−16). A set of 2,485 SNPs show significantly different allele frequencies when compared to populations from other continents. Two notable variants having risk alleles in high frequencies in this subgroup are: a nonsynonymous deleterious SNP (rs2108622 [19:g.15990431C>T] from CYP4F2 gene [MIM:*604426]) associated with warfarin dosage levels [MIM:#122700] required to elicit normal anticoagulant response; and a 3′ UTR SNP (rs6151429 [22:g.51063477T>C]) from ARSA gene [MIM:*607574]) associated with Metachromatic Leukodystrophy [MIM:#250100]. Hemoglobin Riyadh variant (identified for the first time in a Saudi Arabian woman) is observed in the exome data. The mitochondrial haplogroup profiles of the 15 individuals are consistent with the haplogroup diversity seen in Saudi Arabian natives, who are believed to have received substantial gene flow from Africa and eastern provenance. We present the first genome resource imperative for designing future genetic studies in Saudi Arabian tribe subgroup. The full-length genome sequences and the identified variants are available at ftp://dgr.dasmaninstitute.org and http://dgr.dasmaninstitute.org/DGR/gb.html.
The detection of single nucleotide polymorphisms (SNPs) and insertion/deletions (indels) with precision from high-throughput data remains a significant bioinformatics challenge. Accurate detection is necessary before next-generation sequencing can routinely be used in the clinic. In research, scientific advances are inhibited by gaps in data, exemplified by the underrepresented discovery of rare variants, variants in non-coding regions and indels. The continued presence of false positives and false negatives prevents full automation and requires additional manual verification steps. Our methodology presents applications of both pattern recognition and sensitivity analysis to eliminate false positives and aid in the detection of SNP/indel loci and genotypes from high-throughput data. We chose FK506-binding protein 51(FKBP5) (6p21.31) for our clinical target because of its role in modulating pharmacological responses to physiological and synthetic glucocorticoids and because of the complexity of the genomic region. We detected genetic variation across a160 kb region encompassing FKBP5. 613 SNPs and 57 indels, including a 3.3 kb deletion were discovered. We validated our method using three independent data sets and, with Sanger sequencing and Affymetrix and Illumina microarrays, achieved 99% concordance. Furthermore we were able to detect 267 novel rare variants and assess linkage disequilibrium. Our results showed both a sensitivity and specificity of 98%, indicating near perfect classification between true and false variants. The process is scalable and amenable to automation, with the downstream filters taking only 1.5 hours to analyze 96 individuals simultaneously. We provide examples of how our level of precision uncovered the interactions of multiple loci, their predicted influences on mRNA stability, perturbations of the hsp90 binding site, and individual variation in FKBP5 expression. Finally we show how our discovery of rare variants may change current conceptions of evolution at this locus.
pattern recognition; next-generation sequencing analysis; indels; rare variants; FKBP5; HLA
The ABO blood groups result from DNA sequence variations, predominantly single nucleotide and insertion/deletion polymorphisms (SNPs and indels), in the ABO gene encoding a glycosyltransferase. The ABO blood groups A1, A2, B and O predominantly result from the wild type allele A1 and the major gene variants that are characterized by four diallelic markers (261G>del, 802G>A, 803G>C, 1061C>del). Here, we were interested to evaluate the impact of ABO genotyping compared to ABO phenotyping in paternity testing.
The major ABO alleles were determined by PCR amplification with sequence-specific primers (PCR-SSP) in a representative sample of 1,335 blood donors. The genotypes were compared to the ABO blood groups registered in the blood donor files. Then, the ABO phenotypes and genotypes were determined in 95 paternity trio cases that have been investigated by 12 short tandem repeat (STR) markers before. We compared statistical parameters (PL, paternity likelihood; PE, power of exclusion) of both blood grouping approaches.
The prevalence of the major ABO alleles and genotypes corresponded to the expected occurrence of ABO blood groups in a Caucasian population. The low resolution genotyping of 4 diallelic markers revealed a correct genotype-phenotype correlation in 1,331 of 1,335 samples (99.7%). In 60 paternity trios with confirmed paternity of the alleged father based on STR analysis both PL and PE of the ABO genotype was significantly higher than of the ABO phenotype. In 12 of 35 exclusion cases (34.3%) the ABO genotype also excluded the alleged father, whereas the ABO phenotype excluded the alleged father only in 7 cases (20%).
In paternity testing ABO genotyping is superior to ABO phenotyping with regard to PL and PE, however, ABO genotyping is not sufficient for valid paternity testing. Due to the much lower mutation rate compared to STR markers, blood group SNPs in addition to anonymous SNPs could be considered for future kinship analysis and genetic identity testing.
ABO genotyping; Paternity likelihood; Power of exclusion; Paternity trios
With ever-increasing numbers of microbial genomes being sequenced, efficient tools are needed to perform strain-level identification of any newly sequenced genome. Here, we present the SNP identification for strain typing (SNIT) pipeline, a fast and accurate software system that compares a newly sequenced bacterial genome with other genomes of the same species to identify single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels). Based on this information, the pipeline analyzes the polymorphic loci present in all input genomes to identify the genome that has the fewest differences with the newly sequenced genome. Similarly, for each of the other genomes, SNIT identifies the input genome with the fewest differences. Results from five bacterial species show that the SNIT pipeline identifies the correct closest neighbor with 75% to 100% accuracy. The SNIT pipeline is available for download at http://www.bhsai.org/snit.html
The DECIPHER database (https://decipher.sanger.ac.uk/) is an accessible online repository of genetic variation with associated phenotypes that facilitates the identification and interpretation of pathogenic genetic variation in patients with rare disorders. Contributing to DECIPHER is an international consortium of >200 academic clinical centres of genetic medicine and ≥1600 clinical geneticists and diagnostic laboratory scientists. Information integrated from a variety of bioinformatics resources, coupled with visualization tools, provides a comprehensive set of tools to identify other patients with similar genotype–phenotype characteristics and highlights potentially pathogenic genes. In a significant development, we have extended DECIPHER from a database of just copy-number variants to allow upload, annotation and analysis of sequence variants such as single nucleotide variants (SNVs) and InDels. Other notable developments in DECIPHER include a purpose-built, customizable and interactive genome browser to aid combined visualization and interpretation of sequence and copy-number variation against informative datasets of pathogenic and population variation. We have also introduced several new features to our deposition and analysis interface. This article provides an update to the DECIPHER database, an earlier instance of which has been described elsewhere [Swaminathan et al. (2012) DECIPHER: web-based, community resource for clinical interpretation of rare variants in developmental disorders. Hum. Mol. Genet., 21, R37–R44].
Several ways of incorporating indels into phylogenetic analysis have been suggested. Simple indel coding has two strengths: (1) biological realism and (2) efficiency of analysis. In the method, each indel with different start and/or end positions is considered to be a separate character. The presence/absence of these indel characters is then added to the data set.
We have written a program, GapCoder to automate this procedure. The program can input PIR format aligned datasets, find the indels and add the indel-based characters. The output is a NEXUS format file, which includes a table showing what region each indel characters is based on. If regions are excluded from analysis, this table makes it easy to identify the corresponding indel characters for exclusion.
Manual implementation of the simple indel coding method can be very time-consuming, especially in data sets where indels are numerous and/or overlapping. GapCoder automates this method and is therefore particularly useful during procedures where phylogenetic analyses need to be repeated many times, such as when different alignments are being explored or when various taxon or character sets are being explored. GapCoder is currently available for Windows from .
Primary ciliary dyskinesia (PCD) is a genetic disorder, usually autosomal recessive, causing early respiratory disease and later subfertility. Whole exome sequencing may enable efficient analysis for locus heterogeneous disorders such as PCD. We whole exome sequenced one consanguineous Saudi Arabian with clinically diagnosed PCD and normal laterality, to attempt ab initio molecular diagnosis.
We reviewed thirteen known PCD genes and potentially autozygous regions (extended homozygosity) for homozygous exon deletions, non-dbSNP codon, splice-site base variants or small indels. Homozygous non-dbSNP changes were also reviewed exome-wide.
One single molecular read representing RSPH9 p.Lys268del was observed, with no wildtype reads, and a notable deficiency of mapped reads at this location. Among all observations, RSPH9 was the strongest candidate for causality. Searching unmapped reads revealed seven more mutant reads. Direct assay for p.Lys268del (MboII digest) confirmed homozygosity in the affected individual, then confirmed homozygosity in three siblings with bronchiectasis. Our finding in southwest Saudi Arabia indicates that p.Lys268del, previously observed in two Bedouin families (Israel, UAE) is geographically widespread in the Arabian Peninsula. Analogous with cystic fibrosis CFTR p.Phe508del, screening for RSPH9 p.Lys268del (which lacks sentinel dextrocardia) in those at risk would help in early diagnosis, tailored clinical management, genetic counselling and primary prevention.
high-throughput nucleotide sequencing; primary ciliary dyskinesia; screening
Next generation sequence data provides valuable information and tools for genetic and genomic research and offers new insights useful for marker development. This data is useful for the design of accurate and user-friendly molecular tools. Common bean (Phaseolus vulgaris L.) is a diverse crop in which separate domestication events happened in each gene pool followed by race and market class diversification that has resulted in different morphological characteristics in each commercial market class. This has led to essentially independent breeding programs within each market class which in turn has resulted in limited within market class sequence variation. Sequence data from selected genotypes of five bean market classes (pinto, black, navy, and light and dark red kidney) were used to develop InDel-based markers specific to each market class. Design of the InDel markers was conducted through a combination of assembly, alignment and primer design software using 1.6× to 5.1× coverage of Illumina GAII sequence data for each of the selected genotypes. The procedure we developed for primer design is fast, accurate, less error prone, and higher throughput than when they are designed manually. All InDel markers are easy to run and score with no need for PCR optimization. A total of 2687 InDel markers distributed across the genome were developed. To highlight their usefulness, they were employed to construct a phylogenetic tree and a genetic map, showing that InDel markers are reliable, simple, and accurate.
InDel marker; next generation sequencing; market class; phylogenetics; genetic map; common bean
Indels are an important cause of human variation and central to the study of human disease. The 1000 Genomes Project Low-Coverage Pilot identified over 1.3 million indels shorter than 50 bp, of which over 890 were identified as potentially disruptive variants. Yet, despite their ubiquity, the local genomic characteristics of indels remain unexplored.
Herein we describe population- and minor allele frequency-based differences in linkage disequilibrium and imputation characteristics for indels included in the 1000 Genomes Project Low-Coverage Pilot for the CEU, YRI and CHB+JPT populations. Common indels were well tagged by nearby SNPs in all studied populations, and were also tagged at a similar rate to common SNPs. Both neutral and functionally deleterious common indels were imputed with greater than 95% concordance from HapMap Phase 3 and OMNI SNP sites. Further, 38 to 56% of low frequency indels were tagged by low frequency SNPs. We were able to impute heterozygous low frequency indels with over 50% concordance. Lastly, our analysis also revealed evidence of ascertainment bias. This bias prevents us from extending the applicability of our results to highly polymorphic indels that could not be identified in the Low-Coverage Pilot.
Although further scope exists to improve the imputation of low frequency indels, our study demonstrates that there are already ample opportunities to retrospectively impute indels for prior genome-wide association studies and to incorporate indel imputation into future case/control studies.
A high-throughput method for genotyping by mapping InDels. This method has been used to create fragment-length polymorphism maps for Drosophila and C. elegans.
Small insertions or deletions (InDels) constitute a ubiquituous class of sequence polymorphisms found in eukaryotic genomes. Here, we present an automated high-throughput genotyping method that relies on the detection of fragment-length polymorphisms (FLPs) caused by InDels. The protocol utilizes standard sequencers and genotyping software. We have established genome-wide FLP maps for both Caenorhabditis elegans and Drosophila melanogaster that facilitate genetic mapping with a minimum of manual input and at comparatively low cost.