High-throughput sequencing of targeted genomic loci in large populations is an effective approach for evaluating the contribution of rare variants to disease risk. We evaluated the feasibility of using in-solution hybridization-based target capture on pooled DNA samples to enable cost-efficient population sequencing studies. For this, we performed pooled sequencing of 100 HapMap samples across ∼600 kb of DNA sequence using the Illumina GAIIx. Using our accurate variant calling method for pooled sequence data, we were able to not only identify single nucleotide variants with a low false discovery rate (<1%) but also accurately detect short insertion/deletion variants. In addition, with sufficient coverage per individual in each pool (30-fold) we detected 97.2% of the total variants and 93.6% of variants below 5% in frequency. Finally, allele frequencies for single nucleotide variants (SNVs) estimated from the pooled data and the HapMap genotype data were tightly correlated (correlation coefficient > = 0.995).
Here we demonstrate a method for unbiased multiplexed deep sequencing of RNA and DNA libraries using a novel, efficient and adaptable barcoding strategy called Post Amplification Ligation-Mediated (PALM). PALM barcoding is performed as the very last step of library preparation, eliminating a potential barcode-induced bias and allowing the flexibility to synthesize as many barcodes as needed. We sequenced PALM barcoded micro RNA (miRNA) and DNA reference samples and evaluated the quantitative barcode-induced bias in comparison to the same reference samples prepared using the Illumina TruSeq barcoding strategy. The Illumina TruSeq small RNA strategy introduces the barcode during the PCR step using differentially barcoded primers, while the TruSeq DNA strategy introduces the barcode before the PCR step by ligation of differentially barcoded adaptors. Results show virtually no bias between the differentially barcoded miRNA and DNA samples, both for the PALM and the TruSeq sample preparation methods. We also multiplexed miRNA reference samples using a pre-PCR barcode ligation. This barcoding strategy results in significant bias.
Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics.
The diversity and scope of multiplex parallel sequencing applications is steadily increasing. Critically, multiplex parallel sequencing applications methods rely on the use of barcoded primers for sample identification, and the quality of the barcodes directly impacts the quality of the resulting sequence data. Inspection of the recent publications reveals a surprisingly variable quality of the barcodes employed. Some barcodes are made in a semi empirical fashion, without quantitative consideration of error correction or minimal distance properties. After systematic comparison of published barcode sets, including commercially distributed barcoded primers from Illumina and Epicentre, methods for improved, Hamming code-based sequences are suggested and illustrated. Hamming barcodes can be employed for DNA tag designs in many different ways while preserving minimal distance and error-correcting properties. In addition, Hamming barcodes remain flexible with regard to essential biological parameters such as sequence redundancy and GC content. Wider adoption of improved Hamming barcodes is encouraged in multiplex parallel sequencing applications.
The variation resources within the University of California Santa Cruz Genome Browser include polymorphism data drawn from public collections and analyses of these data, along with their display in the context of other genomic annotations. Primary data from dbSNP is included for many organisms, with added information including genomic alleles and orthologous alleles for closely related organisms. Display filtering and coloring is available by variant type, functional class or other annotations. Annotation of potential errors is highlighted and a genomic alignment of the variant's flanking sequence is displayed. HapMap allele frequencies and linkage disequilibrium (LD) are available for each HapMap population, along with non-human primate alleles. The browsing and analysis tools, downloadable data files and links to documentation and other information can be found at .
Motivation: Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing.
Results: We describe a novel statistical approach, CRISP [Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms (SNPs) from Pooled sequencing] that is able to identify both rare and common variants by using two approaches: (i) comparing the distribution of allele counts across multiple pools using contingency tables and (ii) evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer demonstrates that it can detect 80–85% of SNPs identified using individual sequencing while achieving a low false discovery rate (3–5%). Comparison with previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP.
Availability: Implementation of this method is available at http://polymorphism.scripps.edu/∼vbansal/software/CRISP/
Targeted sequencing is a cost-efficient way to obtain answers to biological questions in many projects, but the choice of the enrichment method to use can be difficult. In this study we compared two hybridization methods for target enrichment for massively parallel sequencing and single nucleotide polymorphism (SNP) discovery, namely Nimblegen sequence capture arrays and the SureSelect liquid-based hybrid capture system. We prepared sequencing libraries from three HapMap samples using both methods, sequenced the libraries on the Illumina Genome Analyzer, mapped the sequencing reads back to the genome, and called variants in the sequences. 74–75% of the sequence reads originated from the targeted region in the SureSelect libraries and 41–67% in the Nimblegen libraries. We could sequence up to 99.9% and 99.5% of the regions targeted by capture probes from the SureSelect libraries and from the Nimblegen libraries, respectively. The Nimblegen probes covered 0.6 Mb more of the original 3.1 Mb target region than the SureSelect probes. In each sample, we called more SNPs and detected more novel SNPs from the libraries that were prepared using the Nimblegen method. Thus the Nimblegen method gave better results when judged by the number of SNPs called, but this came at the cost of more over-sampling.
The International HapMap Project provides a key resource of genotypic data on human samples including lymphoblastoid cell lines derived from individuals of four major world populations of African, European, Japanese and Chinese ancestry. Researchers have utilized this resource to identify genetic elements that correlate with various phenotypes such as risks of common diseases, individual drug response and gene expression variation. However, recent comparative studies have suggested that the currently available HapMap genotypic data may not capture a substantial proportion of rare or untyped SNPs in these populations, implying that the HapMap SNPs may not be sufficient for comprehensive association studies. In this paper, three large-scale deep resequencing projects covering the HapMap samples: ENCODE (Encyclopedia of DNA Elements), SeattleSNPs and NIEHS (National Institute of Environmental Health Sciences) Environmental Genome Project are discussed. Prospectively, once integrated with the HapMap resource, these efforts will greatly benefit the next wave of association studies and data mining using these cell lines.
HapMap; lymphoblastoid cell lines; genotype; single nucleotide polymorphism; resequencing
Compared to classical genotyping, targeted next-generation sequencing (tNGS) can be custom-designed to interrogate entire genomic regions of interest, in order to detect novel as well as known variants. To bring down the per-sample cost, one approach is to pool barcoded NGS libraries before sample enrichment. Still, we lack a complete understanding of how this multiplexed tNGS approach and the varying performance of the ever-evolving analytical tools can affect the quality of variant discovery. Therefore, we evaluated the impact of different software tools and analytical approaches on the discovery of single nucleotide polymorphisms (SNPs) in multiplexed tNGS data. To generate our own test model, we combined a sequence capture method with NGS in three experimental stages of increasing complexity (E. coli genes, multiplexed E. coli, and multiplexed HapMap BRCA1/2 regions).
We successfully enriched barcoded NGS libraries instead of genomic DNA, achieving reproducible coverage profiles (Pearson correlation coefficients of up to 0.99) across multiplexed samples, with <10% strand bias. However, the SNP calling quality was substantially affected by the choice of tools and mapping strategy. With the aim of reducing computational requirements, we compared conventional whole-genome mapping and SNP-calling with a new faster approach: target-region mapping with subsequent ‘read-backmapping’ to the whole genome to reduce the false detection rate. Consequently, we developed a combined mapping pipeline, which includes standard tools (BWA, SAMtools, etc.), and tested it on public HiSeq2000 exome data from the 1000 Genomes Project. Our pipeline saved 12 hours of run time per Hiseq2000 exome sample and detected ~5% more SNPs than the conventional whole genome approach. This suggests that more potential novel SNPs may be discovered using both approaches than with just the conventional approach.
We recommend applying our general ‘two-step’ mapping approach for more efficient SNP discovery in tNGS. Our study has also shown the benefit of computing inter-sample SNP-concordances and inspecting read alignments in order to attain more confident results.
Two-stage mapping; Read-backmapping; Software performance; SNP discovery; Multiplexed targeted next-generation sequencing
Novel methods of targeted sequencing of unique regions from complex eukaryotic have generated a great deal of excitement, but critical demonstrations of these methods efficacy with respect to diploid genotype calling and experimental variation are lacking. To address this issue, we optimized microarray-based genomic selection (MGS) for use with the Illumina Genome Analyzer (IGA). A set of 202 fragments (304 kb total) contained within a 1.7-Mb genomic region on human chromosome X were MGS/IGA sequenced in ten female HapMap samples generating a total of 2.4 GB of DNA sequence. At a minimum coverage threshold of 5X, 93.9% of all bases and 94.9% of segregating sites were called, while 57.7% of bases (57.4% of segregating sites) were called at a 50x threshold. Data accuracy at known segregating sites was 98.9% at 5X coverage, rising to 99.6% at 50X coverage. Accuracy at homozygous sites was 98.7% at 5X sequence coverage and 99.5% at 50X coverage. Although accuracy at heterozygous sites was modestly lower, it was still over 92% at 5X coverage and increased to nearly 97% at 50X coverage. These data provide the first demonstration that MGS/IGA sequencing can generate the very high quality sequence data necessary for human genetics research.
All sequence generated in this study have been deposited in NCBI Short Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra, Accession # SRA007913).
Personal Genomes; Direct Selection; Microarray-based Genomic Selection; Illumina Genome Analyzer; Targeted Sequencing; Human Genetics
Advances in automated DNA sequencing technology have greatly increased the scale of genomic and metagenomic studies. An increasingly popular means of increasing project throughput is by multiplexing samples during the sequencing phase. This can be achieved by covalently linking short, unique "barcode" DNA segments to genomic DNA samples, for instance through incorporation of barcode sequences in PCR primers. Although several strategies have been described to insure that barcode sequences are unique and robust to sequencing errors, these have not been integrated into the overall primer design process, thus potentially introducing bias into PCR amplification and/or sequencing steps.
Barcrawl is a software program that facilitates the design of barcoded primers, for multiplexed high-throughput sequencing. The program bartab can be used to deconvolute DNA sequence datasets produced by the use of multiple barcoded primers. This paper describes the functions implemented by barcrawl and bartab and presents a proof-of-concept case study of both programs in which barcoded rRNA primers were designed and validated by high-throughput sequencing.
Barcrawl and bartab can benefit researchers who are engaged in metagenomic projects that employ multiplexed specimen processing. The source code is released under the GNU general public license and can be accessed at .
The characterization of bacterial communities using DNA sequencing has revolutionized our ability to study microbes in nature and discover the ways in which microbial communities affect ecosystem functioning and human health. Here we describe Serial Illumina Sequencing (SI-Seq): a method for deep sequencing of the bacterial 16S rRNA gene using next-generation sequencing technology. SI-Seq serially sequences portions of the V5, V6 and V7 hypervariable regions from barcoded 16S rRNA amplicons using an Illumina short-read genome analyzer. SI-Seq obtains taxonomic resolution similar to 454 pyrosequencing for a fraction of the cost, and can produce hundreds of thousands of reads per sample even with very high multiplexing. We validated SI-Seq using single species and mock community controls, and via a comparison to cystic fibrosis lung microbiota sequenced using 454 FLX Titanium. Our control runs show that SI-Seq has a dynamic range of at least five orders of magnitude, can classify >96% of sequences to the genus level, and performs just as well as 454 and paired-end Illumina methods in estimation of standard microbial ecology diversity measurements. We illustrate the utility of SI-Seq in a pilot sample of central airway secretion samples from cystic fibrosis patients.
This study reports results of an extensive and comprehensive study of genetic diversity in 12 genes of the innate immune system in a population of eastern India. Genomic variation was assayed in 171 individuals by resequencing ~75 kb of DNA comprising these genes in each individual. Almost half of the 548 DNA variants discovered was novel. DNA sequence comparisons with human and chimpanzee reference sequences revealed evolutionary features indicative of natural selection operating among individuals, who are residents of an area with a high load of microbial and other pathogens. Significant differences in allele and haplotype frequencies of the study population were observed with the HapMap populations. Gene and haplotype diversities were observed to be high. The genetic positioning of the study population among the HapMap populations based on data of the innate immunity genes substantially differed from what has been observed for Indian populations based on data of other genes. The reported range of variation in SNP density in the human genome is one SNP per 1.19 kb (chromosome 22) to one SNP per 2.18 kb (chromosome 19). The SNP density in innate immunity genes observed in this study (>3 SNPs kb−1) exceeds the highest density observed for any autosomal chromosome in the human genome. The extensive genomic variation and the distinct haplotype structure of innate immunity genes observed among individuals have possibly resulted from the impact of natural selection.
Host; Pathogen; Evolution; DNA resequencing; Single nucleotide polymorphism; Haplotype; Genome diversity
Recent exponential growth in the throughput of next-generation DNA sequencing platforms has dramatically spurred the use of accessible and scalable targeted resequencing approaches. This includes candidate region diagnostic resequencing and novel variant validation from whole genome or exome sequencing analysis. We have previously demonstrated that selective genomic circularization is a robust in-solution approach for capturing and resequencing thousands of target human genome loci such as exons and regulatory sequences. To facilitate the design and production of customized capture assays for any given region in the human genome, we developed the Human OligoGenome Resource (http://oligogenome.stanford.edu/). This online database contains over 21 million capture oligonucleotide sequences. It enables one to create customized and highly multiplexed resequencing assays of target regions across the human genome and is not restricted to coding regions. In total, this resource provides 92.1% in silico coverage of the human genome. The online server allows researchers to download a complete repository of oligonucleotide probes and design customized capture assays to target multiple regions throughout the human genome. The website has query tools for selecting and evaluating capture oligonucleotides from specified genomic regions.
High throughput sequencing is frequently used to discover the location of regulatory interactions on chromatin. However, techniques that enrich DNA where regulatory activity takes place, such as chromatin immunoprecipitation (ChIP), often yield less DNA than optimal for sequencing library preparation. Existing protocols for picogram-scale libraries require concomitant fragmentation of DNA, pre-amplification, or long overnight steps.
We report a simple and fast library construction method that produces libraries from sub-nanogram quantities of DNA. This protocol yields conventional libraries with barcodes suitable for multiplexed sample analysis on the Illumina platform. We demonstrate the utility of this method by constructing a ChIP-seq library from 100 pg of ChIP DNA that demonstrates equivalent genomic coverage of target regions to a library produced from a larger scale experiment.
Application of this method allows whole genome studies from samples where material or yields are limiting.
Illumina; ChIP-seq; Multiplex; Barcoding; Library preparation
We have developed a new method using the Qbead™ system for high-throughput genotyping of single nucleotide polymorphisms (SNPs). The Qbead system employs fluorescent Qdot™ semiconductor nanocrystals, also known as quantum dots, to encode microspheres that subsequently can be used as a platform for multiplexed assays. By combining mixtures of quantum dots with distinct emission wavelengths and intensities, unique spectral ‘barcodes’ are created that enable the high levels of multiplexing required for complex genetic analyses. Here, we applied the Qbead system to SNP genotyping by encoding microspheres conjugated to allele-specific oligonucleotides. After hybridization of oligonucleotides to amplicons produced by multiplexed PCR of genomic DNA, individual microspheres are analyzed by flow cytometry and each SNP is distinguished by its unique spectral barcode. Using 10 model SNPs, we validated the Qbead system as an accurate and reliable technique for multiplexed SNP genotyping. By modifying the types of probes conjugated to microspheres, the Qbead system can easily be adapted to other assay chemistries for SNP genotyping as well as to other applications such as analysis of gene expression and protein–protein interactions. With its capability for high-throughput automation, the Qbead system has the potential to be a robust and cost-effective platform for a number of applications.
Highly multiplex DNA sequencers have greatly expanded our ability to survey human genomes for previously unknown single nucleotide polymorphisms (SNPs). However, sequencing and mapping errors, though rare, contribute substantially to the number of false discoveries in current SNP callers. We demonstrate that we can significantly reduce the number of false positive SNP calls by pooling information across samples. Although many studies prepare and sequence multiple samples with the same protocol, most existing SNP callers ignore cross-sample information. In contrast, we propose an empirical Bayes method that uses cross-sample information to learn the error properties of the data. This error information lets us call SNPs with a lower false discovery rate than existing methods.
Next-generation DNA sequencing is opening new avenues for genetic association studies in common diseases that, like deep vein thrombosis (DVT), have a strong genetic predisposition still largely unexplained by currently identified risk variants. In order to develop sequencing and analytical pipelines for the application of next-generation sequencing to complex diseases, we conducted a pilot study sequencing the coding area of 186 hemostatic/proinflammatory genes in 10 Italian cases of idiopathic DVT and 12 healthy controls.
A molecular-barcoding strategy was used to multiplex DNA target capture and sequencing, while retaining individual sequence information. Genomic libraries with barcode sequence-tags were pooled (in pools of 8 or 16 samples) and enriched for target DNA sequences. Sequencing was performed on ABI SOLiD-4 platforms. We produced > 12 gigabases of raw sequence data to sequence at high coverage (average: 42X) the 700-kilobase target area in 22 individuals. A total of 1876 high-quality genetic variants were identified (1778 single nucleotide substitutions and 98 insertions/deletions). Annotation on databases of genetic variation and human disease mutations revealed several novel, potentially deleterious mutations. We tested 576 common variants in a case-control association analysis, carrying the top-5 associations over to replication in up to 719 DVT cases and 719 controls. We also conducted an analysis of the burden of nonsynonymous variants in coagulation factor and anticoagulant genes. We found an excess of rare missense mutations in anticoagulant genes in DVT cases compared to controls and an association for a missense polymorphism of FGA (rs6050; p = 1.9 × 10-5, OR 1.45; 95% CI, 1.22-1.72; after replication in > 1400 individuals).
We implemented a barcode-based strategy to efficiently multiplex sequencing of hundreds of candidate genes in several individuals. In the relatively small dataset of our pilot study we were able to identify bona fide associations with DVT. Our study illustrates the potential of next-generation sequencing for the discovery of genetic variation predisposing to complex diseases.
Deep vein thrombosis; venous thromboembolism; next-generation sequencing; target capture; multiplexing; FGA; rs6025; heamostateome; DVT; VTE
With the recent growth of information on sequence variations in the human genome, predictions regarding the functional effects and relevance to disease phenotypes of coding sequence variations are becoming increasingly important. The aims of this study were to catalog protein-coding sequence variations (CVs) occurring in genetic variation databases and to use bioinformatic programs to analyze CVs. In addition, we aim to provide insight into the functionality of the reference databases.
Methodology and Findings
To catalog CVs on a genome-wide scale with regard to protein function and disease, we investigated three representative databases; the Human Gene Mutation Database (HGMD), the Single Nucleotide Polymorphisms database (dbSNP), and the Haplotype Map (HapMap). Using these three databases, we analyzed CVs at the protein function level with bioinformatic programs. We proposed a combinatorial approach using the Support Vector Machine (SVM) to increase the performance of the prediction programs. By cataloging the coding sequence variations using these databases, we found that 4.36% of CVs from HGMD are concurrently registered in dbSNP (8.11% of CVs from dbSNP are concurrent in HGMD). The pattern of substitutions and functional consequences predicted by three bioinformatic programs was significantly different among concurrent CVs, and CVs occurring solely in HGMD or in dbSNP. The experimental results showed that the proposed SVM combination noticeably outperformed the individual prediction programs.
This is the first study to compare human sequence variations in HGMD, dbSNP and HapMap at the genome-wide level. We found that a significant proportion of CVs in HGMD and dbSNP overlap, and we emphasize the need to use caution when interpreting the phenotypic relevance of these concurrent CVs. Combining bioinformatic programs can be helpful in predicting the functional consequences of CVs because it improved the performance of functional predictions.
Comprehensive sequence characterization across the MHC is important for successful organ transplantation and genetic association studies. To this end, we have developed an automated sample preparation, molecular barcoding and multiplexing protocol for the amplification and sequence-determination of class I HLA loci. We have coupled this process to a novel HLA calling algorithm to determine the most likely pair of alleles at each locus.
We have benchmarked our protocol with 270 HapMap individuals from four worldwide populations with 96.4% accuracy at 4-digit resolution. A variation of this initial protocol, more suitable for large sample sizes, in which molecular barcodes are added during PCR rather than library construction, was tested on 95 HapMap individuals with 98.6% accuracy at 4-digit resolution.
Next-generation sequencing on the 454 FLX Titanium platform is a reliable, efficient, and scalable technology for HLA typing.
This study reports progress in assembling a DNA barcode reference library for Ephemeroptera, Plecoptera, and Trichoptera ("EPTs") from a Canadian subarctic site, which is the focus of a comprehensive biodiversity inventory using DNA barcoding. These three groups of aquatic insects exhibit a moderate level of species diversity, making them ideal for testing the feasibility of DNA barcoding for routine biotic surveys. We explore the correlation between the morphological species delineations, DNA barcode-based haplotype clusters delimited by a sequence threshold (2%), and a threshold-free approach to biodiversity quantification--phylogenetic diversity.
A DNA barcode reference library is built for 112 EPT species for the focal region, consisting of 2277 COI sequences. Close correspondence was found between EPT morphospecies and haplotype clusters as designated using a standard threshold value. Similarly, the shapes of taxon accumulation curves based upon haplotype clusters were very similar to those generated using phylogenetic diversity accumulation curves, but were much more computationally efficient.
The results of this study will facilitate other lines of research on northern EPTs and also bode well for rapidly conducting initial biodiversity assessments in unknown EPT faunas.
Single nucleotide polymorphism (SNP) discovery and genotyping are essential to genetic mapping. There remains a need for a simple, inexpensive platform that allows high-density SNP discovery and genotyping in large populations. Here we describe the sequencing of restriction-site associated DNA (RAD) tags, which identified more than 13,000 SNPs, and mapped three traits in two model organisms, using less than half the capacity of one Illumina sequencing run. We demonstrated that different marker densities can be attained by choice of restriction enzyme. Furthermore, we developed a barcoding system for sample multiplexing and fine mapped the genetic basis of lateral plate armor loss in threespine stickleback by identifying recombinant breakpoints in F2 individuals. Barcoding also facilitated mapping of a second trait, a reduction of pelvic structure, by in silico re-sorting of individuals. To further demonstrate the ease of the RAD sequencing approach we identified polymorphic markers and mapped an induced mutation in Neurospora crassa. Sequencing of RAD markers is an integrated platform for SNP discovery and genotyping. This approach should be widely applicable to genetic mapping in a variety of organisms.
Multiplexing is of vital importance for utilizing the full potential of next generation sequencing technologies. We here report TagGD (DNA-based Tag Generator and Demultiplexor), a fully-customisable, fast and accurate software package that can generate thousands of barcodes satisfying user-defined constraints and can guarantee full demultiplexing accuracy. The barcodes are designed to minimise their interference with the experiment. Insertion, deletion and substitution events are considered when designing and demultiplexing barcodes. 20,000 barcodes of length 18 were designed in 5 minutes and 2 million barcoded Illumina HiSeq-like reads generated with an error rate of 2% were demultiplexed with full accuracy in 5 minutes. We believe that our software meets a central demand in the current high-throughput biology and can be utilised in any field with ample sample abundance. The software is available on GitHub (https://github.com/pelinakan/UBD.git).
Mitochondrial disorders can originate from mutations in one of many nuclear genes controlling the organelle function or in the mitochondrial genome (mitochondrial DNA (mtDNA)). The large numbers of potential culprit genes, together with the little guidance offered by most clinical phenotypes as to which gene may be causative, are a great challenge for the molecular diagnosis of these disorders.
We developed a novel targeted resequencing assay for mitochondrial disorders relying on microarray-based hybrid capture coupled to next-generation sequencing. Specifically, we subjected the entire mtDNA genome and the exons and intron-exon boundary regions of 362 known or candidate causative nuclear genes to targeted capture and resequencing. We here provide proof-of-concept data by testing one HapMap DNA sample and two positive control samples.
Over 94% of the targeted regions were captured and sequenced with appropriate coverage and quality, allowing reliable variant calling. Pathogenic mutations blindly tested in patients' samples were 100% concordant with previous Sanger sequencing results: a known mutation in Pyruvate dehydrogenase alpha 1 subunit (PDHA1), a novel splicing and a known coding mutation in Hydroxyacyl-CoA dehydrogenase alpha subunit (HADHA) were correctly identified. Of the additional variants recognized, 90 to 94% were present in dbSNP while 6 to 10% represented new alterations. The novel nonsynonymous variants were all in heterozygote state and mostly predicted to be benign. The depth of sequencing coverage of mtDNA was extremely high, suggesting that it may be feasible to detect pathogenic mtDNA mutations confounded by low level heteroplasmy. Only one sequencing lane of an eight lane flow cell was utilized for each sample, indicating that a cost-effective clinical test can be achieved.
Our study indicates that the use of next generation sequencing technology holds great promise as a tool for screening mitochondrial disorders. The availability of a comprehensive molecular diagnostic tool will increase the capacity for early and rapid identification of mitochondrial disorders. In addition, the proposed approach has the potential to identify new mutations in candidate genes, expanding and redefining the spectrum of causative genes responsible for mitochondrial disorders.
High-throughput sequencing (HTS) has quickly become a valuable tool for comparative genetics and genomics and is now regularly carried out in laboratories that are not connected to large sequencing centers. Here we describe an updated version of our protocol for constructing single- and paired-end Illumina sequencing libraries, beginning with purified genomic DNA. The present protocol can also be used for “multiplexing,” i.e. the analysis of several samples in a single flowcell lane by generating “barcoded” or “indexed” Illumina sequencing libraries in a way that is independent from Illumina-supported methods. To analyze sequencing results, we suggest several independent approaches but end users should be aware that this is a quickly evolving field and that currently many alignment (or “mapping”) and counting algorithms are being developed and tested.
High-throughput sequencing; Solexa/illumina library; Reference genome; Paired-end library; De novo assembly; Barcoding