Targeted genomic enrichment (TGE) is a widely used method for isolating and enriching specific genomic regions prior to massively parallel sequencing. To make effective use of sequencer output, barcoding and sample pooling (multiplexing) after TGE and prior to sequencing (post-capture multiplexing) has become routine. While previous reports have indicated that multiplexing prior to capture (pre-capture multiplexing) is feasible, no thorough examination of the effect of this method has been completed on a large number of samples. Here we compare standard post-capture TGE to two levels of pre-capture multiplexing: 12 or 16 samples per pool. We evaluated these methods using standard TGE metrics and determined the ability to identify several classes of genetic mutations in three sets of 96 samples, including 48 controls. Our overall goal was to maximize cost reduction and minimize experimental time while maintaining a high percentage of reads on target and a high depth of coverage at thresholds required for variant detection.
We adapted the standard post-capture TGE method for pre-capture TGE with several protocol modifications, including redesign of blocking oligonucleotides and optimization of enzymatic and amplification steps. Pre-capture multiplexing reduced costs for TGE by at least 38% and significantly reduced hands-on time during the TGE protocol. We found that pre-capture multiplexing reduced capture efficiency by 23 or 31% for pre-capture pools of 12 and 16, respectively. However efficiency losses at this step can be compensated by reducing the number of simultaneously sequenced samples. Pre-capture multiplexing and post-capture TGE performed similarly with respect to variant detection of positive control mutations. In addition, we detected no instances of sample switching due to aberrant barcode identification.
Pre-capture multiplexing improves efficiency of TGE experiments with respect to hands-on time and reagent use compared to standard post-capture TGE. A decrease in capture efficiency is observed when using pre-capture multiplexing; however, it does not negatively impact variant detection and can be accommodated by the experimental design.
Massively parallel sequencing; Next-generation sequencing; Genomics; Targeted genomic enrichment; Sequence capture; Pre-capture multiplexing; Post-capture multiplexing; Indexing
We have developed an integrated strategy for targeted resequencing and analysis of gene subsets from the human exome for variants. Our capture technology is geared towards resequencing gene subsets substantially larger than can be done efficiently with simplex or multiplex PCR but smaller in scale than exome sequencing. We describe all the steps from the initial capture assay to single nucleotide variant (SNV) discovery. The capture methodology uses in-solution 80-mer oligonucleotides. To provide optimal flexibility in choosing human gene targets, we designed an in silico set of oligonucleotides, the Human OligoExome, that covers the gene exons annotated by the Consensus Coding Sequencing Project (CCDS). This resource is openly available as an Internet accessible database where one can download capture oligonucleotides sequences for any CCDS gene and design custom capture assays. Using this resource, we demonstrated the flexibility of this assay by custom designing capture assays ranging from 10 to over 100 gene targets with total capture sizes from over 100 Kilobases to nearly one Megabase. We established a method to reduce capture variability and incorporated indexing schemes to increase sample throughput. Our approach has multiple applications that include but are not limited to population targeted resequencing studies of specific gene subsets, validation of variants discovered in whole genome sequencing surveys and possible diagnostic analysis of disease gene subsets. We also present a cost analysis demonstrating its cost-effectiveness for large population studies.
Despite the ever-increasing throughput and steadily decreasing cost of next
generation sequencing (NGS), whole genome sequencing of humans is still not a
viable option for the majority of genetics laboratories. This is particularly
true in the case of complex disease studies, where large sample sets are often
required to achieve adequate statistical power. To fully leverage the potential
of NGS technology on large sample sets, several methods have been developed to
selectively enrich for regions of interest. Enrichment reduces both monetary and
computational costs compared to whole genome sequencing, while allowing
researchers to take advantage of NGS throughput. Several targeted enrichment
approaches are currently available, including molecular inversion probe ligation
sequencing (MIPS), oligonucleotide hybridization based approaches, and PCR-based
strategies. To assess how these methods performed when used in conjunction with
the ABI SOLID3+, we investigated three enrichment techniques: Nimblegen
oligonucleotide hybridization array-based capture; Agilent SureSelect
oligonucleotide hybridization solution-based capture; and Raindance
Technologies' multiplexed PCR-based approach. Target regions were selected
from exons and evolutionarily conserved areas throughout the human genome. Probe
and primer pair design was carried out for all three methods using their
respective informatics pipelines. In all, approximately 0.8 Mb of target space
was identical for all 3 methods. SOLiD sequencing results were analyzed for
several metrics, including consistency of coverage depth across samples,
on-target versus off-target efficiency, allelic bias, and genotype concordance
with array-based genotyping data. Agilent SureSelect exhibited superior
on-target efficiency and correlation of read depths across samples. Nimblegen
performance was similar at read depths at 20× and below. Both Raindance
and Nimblegen SeqCap exhibited tighter distributions of read depth around the
mean, but both suffered from lower on-target efficiency in our experiments.
Raindance demonstrated the highest versatility in assay design.
Next-generation DNA sequencing has revolutionized the discovery of rare polymorphisms, structural variants, and novel transcripts. To meet the demand for fast, cost-effective, and accurate genome analysis methods from small scale studies to large sample cohorts, Agilent Technologies has developed the SureSelect™ Target Enrichment System. Available for the Illumina, SOLiD, and 454 NGS sequencing platforms, SureSelect is a highly robust, customizable, and scalable system that focuses analyses on specific genomic loci by in-solution hybrid capture. In addition, Agilent has introduced SureSelect XT for Illumina and SOLiD, which combines gDNA prep, library prep, and SureSelect Target Enrichment reagents in one complete kit. Both SureSelect and SureSelect XT demonstrate high performance, as measured by capture efficiency, uniformity, reproducibility, and SNP detection. We highlight the utility of the SureSelect system across a wide range of target sizes and genome complexity using pre-designed catalog libraries targeting cancer gene sets, sequences encoding the kinome, and both human and mouse All Exon content. In addition, user-defined custom content can be easily developed using the Agilent eArray software with candidate variant coordinates as input. User-defined content can be manufactured on-demand as a custom SureSelect kit, or combined with pre-defined Agilent catalog content using the Plus option. We propose a novel approach for variant discovery - using SureSelect catalog designs to uncover candidate variants, followed by the design of smaller focused custom libraries for SNP validation and region profiling. By pooling many samples together per lane or slide, SureSelect multiplexing kits for Illumina and SOLiD enable validation across large sample cohorts with substantial cost savings. Accurate post target enrichment pooling is facilitated by the Agilent Bioanalyzer and QPCR NGS Library Quantification kits which ensure equal representation across samples. Further efficiencies are realized using the Bravo Automated Liquid Handling Platform to meet the need for parallel preparation of multiplexed libraries.
As DNA sequencing technology has markedly advanced in recent years2, it has become increasingly evident that the amount of genetic variation between any two individuals is greater than previously thought3. In contrast, array-based genotyping has failed to identify a significant contribution of common sequence variants to the phenotypic variability of common disease4,5. Taken together, these observations have led to the evolution of the Common Disease / Rare Variant hypothesis suggesting that the majority of the "missing heritability" in common and complex phenotypes is instead due to an individual's personal profile of rare or private DNA variants6-8. However, characterizing how rare variation impacts complex phenotypes requires the analysis of many affected individuals at many genomic loci, and is ideally compared to a similar survey in an unaffected cohort. Despite the sequencing power offered by today's platforms, a population-based survey of many genomic loci and the subsequent computational analysis required remains prohibitive for many investigators.
To address this need, we have developed a pooled sequencing approach1,9 and a novel software package1 for highly accurate rare variant detection from the resulting data. The ability to pool genomes from entire populations of affected individuals and survey the degree of genetic variation at multiple targeted regions in a single sequencing library provides excellent cost and time savings to traditional single-sample sequencing methodology. With a mean sequencing coverage per allele of 25-fold, our custom algorithm, SPLINTER, uses an internal variant calling control strategy to call insertions, deletions and substitutions up to four base pairs in length with high sensitivity and specificity from pools of up to 1 mutant allele in 500 individuals. Here we describe the method for preparing the pooled sequencing library followed by step-by-step instructions on how to use the SPLINTER package for pooled sequencing analysis (http://www.ibridgenetwork.org/wustl/splinter). We show a comparison between pooled sequencing of 947 individuals, all of whom also underwent genome-wide array, at over 20kb of sequencing per person. Concordance between genotyping of tagged and novel variants called in the pooled sample were excellent. This method can be easily scaled up to any number of genomic loci and any number of individuals. By incorporating the internal positive and negative amplicon controls at ratios that mimic the population under study, the algorithm can be calibrated for optimal performance. This strategy can also be modified for use with hybridization capture or individual-specific barcodes and can be applied to the sequencing of naturally heterogeneous samples, such as tumor DNA.
Genetics; Issue 64; Genomics; Cancer Biology; Bioinformatics; Pooled DNA sequencing; SPLINTER; rare genetic variants; genetic screening; phenotype; high throughput; computational analysis; DNA; PCR; primers
Recent exponential growth in the throughput of next-generation DNA sequencing platforms has dramatically spurred the use of accessible and scalable targeted resequencing approaches. This includes candidate region diagnostic resequencing and novel variant validation from whole genome or exome sequencing analysis. We have previously demonstrated that selective genomic circularization is a robust in-solution approach for capturing and resequencing thousands of target human genome loci such as exons and regulatory sequences. To facilitate the design and production of customized capture assays for any given region in the human genome, we developed the Human OligoGenome Resource (http://oligogenome.stanford.edu/). This online database contains over 21 million capture oligonucleotide sequences. It enables one to create customized and highly multiplexed resequencing assays of target regions across the human genome and is not restricted to coding regions. In total, this resource provides 92.1% in silico coverage of the human genome. The online server allows researchers to download a complete repository of oligonucleotide probes and design customized capture assays to target multiple regions throughout the human genome. The website has query tools for selecting and evaluating capture oligonucleotides from specified genomic regions.
Computer programs for the generation of multiple sequence alignments such as "Clustal W" allow detection of regions that are most conserved among many sequence variants. However, even for regions that are equally conserved, their potential utility as hybridization targets varies. Mismatches in sequence variants are more disruptive in some duplexes than in others. Additionally, the propensity for self-interactions amongst oligonucleotides targeting conserved regions differs and the structure of target regions themselves can also influence hybridization efficiency. There is a need to develop software that will employ thermodynamic selection criteria for finding optimal hybridization targets in related sequences.
A new scheme and new software for optimal detection of oligonucleotide hybridization targets common to families of aligned sequences is suggested and applied to aligned sequence variants of the complete HIV-1 genome. The scheme employs sequential filtering procedures with experimentally determined thermodynamic cut off points: 1) creation of a consensus sequence of RNA or DNA from aligned sequence variants with specification of the lengths of fragments to be used as oligonucleotide targets in the analyses; 2) selection of DNA oligonucleotides that have pairing potential, greater than a defined threshold, with all variants of aligned RNA sequences; 3) elimination of DNA oligonucleotides that have self-pairing potentials for intra- and inter-molecular interactions greater than defined thresholds. This scheme has been applied to the HIV-1 genome with experimentally determined thermodynamic cut off points. Theoretically optimal RNA target regions for consensus oligonucleotides were found. They can be further used for improvement of oligo-probe based HIV detection techniques.
A selection scheme with thermodynamic thresholds and software is presented in this study. The package can be used for any purpose where there is a need to design optimal consensus oligonucleotides capable of interacting efficiently with hybridization targets common to families of aligned RNA or DNA sequences. Our thermodynamic approach can be helpful in designing consensus oligonucleotides with consistently high affinity to target variants in evolutionary related genes or genomes.
Wolbachia endosymbionts are widespread in arthropods and are generally considered reproductive parasites, inducing various phenotypes including cytoplasmic incompatibility, parthenogenesis, feminization and male killing, which serve to promote their spread through populations. In contrast, Wolbachia infecting filarial nematodes that cause human diseases, including elephantiasis and river blindness, are obligate mutualists. DNA purification methods for efficient genomic sequencing of these unculturable bacteria have proven difficult using a variety of techniques. To efficiently capture endosymbiont DNA for studies that examine the biology of symbiosis, we devised a parallel strategy to an earlier array-based method by creating a set of SureSelect™ (Agilent) 120-mer target enrichment RNA oligonucleotides (“baits”) for solution hybrid selection. These were designed from Wolbachia complete and partial genome sequences in GenBank and were tiled across each genomic sequence with 60 bp overlap. Baits were filtered for homology against host genomes containing Wolbachia using BLAT and sequences with significant host homology were removed from the bait pool. Filarial parasite Brugia malayi DNA was used as a test case, as the complete sequence of both Wolbachia and its host are known. DNA eluted from capture was size selected and sequencing samples were prepared using the NEBNext® Sample Preparation Kit. One-third of a 50 nt paired-end sequencing lane on the HiSeq™ 2000 (Illumina) yielded 53 million reads and the entirety of the Wolbachia genome was captured. We then used the baits to isolate more than 97.1 % of the genome of a distantly related Wolbachia strain from the crustacean Armadillidium vulgare, demonstrating that the method can be used to enrich target DNA from unculturable microbes over large evolutionary distances.
Wolbachia; Obligate endosymbiont; Target enrichment; NextGen sequencing; DNA capture; SureSelect™
Techniques enabling targeted re-sequencing of the protein coding sequences of the human genome on next generation sequencing instruments are of great interest. We conducted a systematic comparison of the solution-based exome capture kits provided by Agilent and Roche NimbleGen. A control DNA sample was captured with all four capture methods and prepared for Illumina GAII sequencing. Sequence data from additional samples prepared with the same protocols were also used in the comparison.
We developed a bioinformatics pipeline for quality control, short read alignment, variant identification and annotation of the sequence data. In our analysis, a larger percentage of the high quality reads from the NimbleGen captures than from the Agilent captures aligned to the capture target regions. High GC content of the target sequence was associated with poor capture success in all exome enrichment methods. Comparison of mean allele balances for heterozygous variants indicated a tendency to have more reference bases than variant bases in the heterozygous variant positions within the target regions in all methods. There was virtually no difference in the genotype concordance compared to genotypes derived from SNP arrays. A minimum of 11× coverage was required to make a heterozygote genotype call with 99% accuracy when compared to common SNPs on genome-wide association arrays.
Libraries captured with NimbleGen kits aligned more accurately to the target regions. The updated NimbleGen kit most efficiently covered the exome with a minimum coverage of 20×, yet none of the kits captured all the Consensus Coding Sequence annotated exons.
Identification of gene variants plays an important role in research on and diagnosis of genetic diseases. A combination of enrichment of targeted genes and next-generation sequencing (targeted DNA-HiSeq) results in both high efficiency and low cost for targeted sequencing of genes of interest.
To identify mutations associated with genetic diseases, we designed an array-based gene chip to capture all of the exons of 193 genes involved in 103 genetic diseases. To evaluate this technology, we selected 7 samples from seven patients with six different genetic diseases resulting from six disease-causing genes and 100 samples from normal human adults as controls. The data obtained showed that on average, 99.14% of 3,382 exons with more than 30-fold coverage were successfully detected using Targeted DNA-HiSeq technology, and we found six known variants in four disease-causing genes and two novel mutations in two other disease-causing genes (the STS gene for XLI and the FBN1 gene for MFS) as well as one exon deletion mutation in the DMD gene. These results were confirmed in their entirety using either the Sanger sequencing method or real-time PCR.
Targeted DNA-HiSeq combines next-generation sequencing with the capture of sequences from a relevant subset of high-interest genes. This method was tested by capturing sequences from a DNA library through hybridization to oligonucleotide probes specific for genetic disorder-related genes and was found to show high selectivity, improve the detection of mutations, enabling the discovery of novel variants, and provide additional indel data. Thus, targeted DNA-HiSeq can be used to analyze the gene variant profiles of monogenic diseases with high sensitivity, fidelity, throughput and speed.
Rare genetic variation in the human population is a major source of pathophysiological variability and has been implicated in a host of complex phenotypes and diseases. Finding disease-related genes harboring disparate functional rare variants requires sequencing of many individuals across many genomic regions and comparing against unaffected cohorts. However, despite persistent declines in sequencing costs, population-based rare variant detection across large genomic target regions remains cost prohibitive for most investigators. In addition, DNA samples are often precious and hybridization methods typically require large amounts of input DNA. Pooled sample DNA sequencing is a cost and time-efficient strategy for surveying populations of individuals for rare variants. We set out to 1) create a scalable, multiplexing method for custom capture with or without individual DNA indexing that was amenable to low amounts of input DNA and 2) expand the functionality of the SPLINTER algorithm for calling substitutions, insertions and deletions across either candidate genes or the entire exome by integrating the variant calling algorithm with the dynamic programming aligner, Novoalign.
We report methodology for pooled hybridization capture with pre-enrichment, indexed multiplexing of up to 48 individuals or non-indexed pooled sequencing of up to 92 individuals with as little as 70 ng of DNA per person. Modified solid phase reversible immobilization bead purification strategies enable no sample transfers from sonication in 96-well plates through adapter ligation, resulting in 50% less library preparation reagent consumption. Custom Y-shaped adapters containing novel 7 base pair index sequences with a Hamming distance of ≥2 were directly ligated onto fragmented source DNA eliminating the need for PCR to incorporate indexes, and was followed by a custom blocking strategy using a single oligonucleotide regardless of index sequence. These results were obtained aligning raw reads against the entire genome using Novoalign followed by variant calling of non-indexed pools using SPLINTER or SAMtools for indexed samples. With these pipelines, we find sensitivity and specificity of 99.4% and 99.7% for pooled exome sequencing. Sensitivity, and to a lesser degree specificity, proved to be a function of coverage. For rare variants (≤2% minor allele frequency), we achieved sensitivity and specificity of ≥94.9% and ≥99.99% for custom capture of 2.5 Mb in multiplexed libraries of 22–48 individuals with only ≥5-fold coverage/chromosome, but these parameters improved to ≥98.7 and 100% with 20-fold coverage/chromosome.
This highly scalable methodology enables accurate rare variant detection, with or without individual DNA sample indexing, while reducing the amount of required source DNA and total costs through less hybridization reagent consumption, multi-sample sonication in a standard PCR plate, multiplexed pre-enrichment pooling with a single hybridization and lesser sequencing coverage required to obtain high sensitivity.
Rare variants; Genomics; Exome; Hybridization capture; Multiplexed capture; Indexed capture; SPLINTER
Antibody (Ab) discovery research has accelerated as monoclonal Ab (mAb)-based biologic strategies have proved efficacious in the treatment of many human diseases, ranging from cancer to autoimmunity. Initial steps in the discovery of therapeutic mAb require epitope characterization and preclinical studies in vitro and in animal models often using limited quantities of Ab. To facilitate this research, our Shared Resource Laboratory (SRL) offers microscale Ab conjugation. Ab submitted for conjugation may or may not be commercially produced, but have not been characterized for use in immunofluorescence applications. Purified mAb and even polyclonal Ab (pAb) can be efficiently conjugated, although the advantages of direct conjugation are more obvious for mAb. To improve consistency of results in microscale (<100ug) conjugation reactions, we chose to utilize several different varieties of commercial kits. Kits tested were limited to covalent fluorophore labeling. Established quality control (QC) processes to validate fluorophore labeling either rely solely on spectrophotometry or utilize flow cytometry of cells expected to express the target antigen. This methodology is not compatible with microscale reactions using uncharacterized Ab. We developed a novel method for cell-free QC of our conjugates that reflects conjugation quality, but is independent of the biological properties of the Ab itself. QC is critical, as amine reactive chemistry relies on the absence of even trace quantities of competing amine moieties such as those found in the Good buffers (HEPES, MOPS, TES, etc.) or irrelevant proteins. Herein, we present data used to validate our method of assessing the extent of labeling and the removal of free dye by using flow cytometric analysis of polystyrene Ab capture beads to verify product quality. This microscale custom conjugation and QC allows for the rapid development and validation of high quality reagents, specific to the needs of our colleagues and clientele. Next generation sequencing (NGS) technologies provide the potential for developing high-throughput and low-cost platforms for clinical diagnostics. A limiting factor to clinical applications of genomic NGS is downstream bioinformatics analysis. Most analysis pipelines do not connect genomic variants to disease and protein specific information during the initial filtering and selection of relevant variants. Robust bioinformatics pipelines were implemented for trimming, genome alignment, SNP, INDEL, or structural variation detection of whole genome or exon-capture sequencing data from Illumina. Quality control metrics were analyzed at each step of the pipeline to ensure data integrity for clinical applications. We further annotate the variants with statistics regarding the diseased population and variant impact. Custom algorithms were developed to analyze the variant data by filtering variants based upon criteria such as quality of variant, inheritance pattern (e.g. dominant, recessive, X-linked), and impact of variant. The resulting variants and their associated genes are linked to Integrated Genome Browser (IGV) in a genome context, and to the PIR iProXpress system for rich protein and disease information. This poster will present detailed analysis of whole exome sequencing performed on patients with facio-skeletal anomalies. We will compare and contrast data analysis methods and report on potential clinically relevant leads discovered by implementing our new clinical variant pipeline. Our variant analysis of these patients and their unaffected family members resulted in more than 500,000 variants. By applying our system of annotations, prioritizations, inheritance filters, and functional profiling and analysis, we have created a unique methodology for further filtering of disease relevant variants that impact protein coding genes. Taken together, the integrative approach allows better selection of disease relevant genomic variants by using both genomic and disease/protein centric information. This type of clustering approach can help clinicians better understand the association of variants to the disease phenotype, enabling application to personalized medicine approaches.
We have developed a targeted resequencing approach referred to as Oligonucleotide-Selective Sequencing. In this study, we report a series of significant improvements and novel applications of this method whereby the surface of a sequencing flow cell is modified in situ to capture specific genomic regions of interest from a sample and then sequenced. These improvements include a fully automated targeted sequencing platform through the use of a standard Illumina cBot fluidics station. Targeting optimization increased the yield of total on-target sequencing data 2-fold compared to the previous iteration, while simultaneously increasing the percentage of reads that could be mapped to the human genome. The described assays cover up to 1421 genes with a total coverage of 5.5 Megabases (Mb). We demonstrate a 10-fold abundance uniformity of greater than 90% in 1 log distance from the median and a targeting rate of up to 95%. We also sequenced continuous genomic loci up to 1.5 Mb while simultaneously genotyping SNPs and genes. Variants with low minor allele fraction were sensitively detected at levels of 5%. Finally, we determined the exact breakpoint sequence of cancer rearrangements. Overall, this approach has high performance for selective sequencing of genome targets, configuration flexibility and variant calling accuracy.
Next-generation sequencing requires specialized and often time-consuming methods to select particular nucleic acid fractions and generate libraries suitable for sequencing. The complexity and time requirements of these methods make automation highly desirable, particularly as sequencing becomes more common and higher throughput. We present here new, automatable methods to deplete ribosomal RNA (rRNA) from a total RNA sample for subsequent sequencing and efficient, high-yield library construction. It is desirable to remove rRNA for RNA-seq, since it comprises 85–95% of total RNA, occupies valuable sequencing capacity, and results in a low signal-to-noise ratio that can make detection and analysis of the RNA species of interest difficult. Our method (the GeneRead rRNA Depletion Kit) effectively removes rRNA, while ensuring complete recovery of mRNA and noncoding RNA from various species; including human, mouse, and rat. The method involves specific oligonucleotide probes, designed to hybridize to the large (18s, 28s), small (5s, 5.8s), and mitochondrial (12s, 16s) rRNAs. The rRNA:DNA hybrid is recognized by a hybrid-specific antibody that can be captured on a bead and removed from the sample, depleting the rRNA. This antibody-mediated capture provides a higher level of specificity of rRNA depletion than other methods, works well with fragmented samples, and preserves noncoding RNA. The method can be performed manually or automated on the QIAcube from hybridization, including subsequent RNA cleanup. Kit performance was tested using qRT-PCR and sequencing. Comparison with other rRNA depletion techniques revealed the GeneRead rRNA Depletion Kit effectively eliminates rRNA, while better preserving the natural representation of other RNAs. This method improves the ratio of useful data, decreases bias, and preserves noncoding RNA, providing high-quality RNA highly suited for next-generation sequencing applications. For the Ion Torrent and Illumina platforms, we have developed methods that simplify the library construction process, leading to higher yields and time savings. We have integrated a single-tube protocol for library fragment trimming and adapter ligation, followed by library purification and adapter–dimer depletion into one straightforward workflow. This enables construction of high-quality libraries from as little as 50 ng of nucleic acid and allows the process to be automated on the QIAcube. Multiple libraries prepared from one sample using the automated procedure on the QIAcube show a very high consistency of the resulting libraries with comparably high yields. The libraries generated also have the full-length library adapters, enabling the preparations obtained to be used directly for sequencing. For the optional library amplification step, a newly developed high-fidelity DNA polymerase can be used that minimizes the amplification-induced sequence biases in AT- and GC-rich regions.
Targeted sequence capture is a promising technology in many areas in biology. These methods enable efficient and relatively inexpensive sequencing of hundreds to thousands of genes or genomic regions from many more individuals than is practical using whole-genome sequencing approaches. Here, we demonstrate the feasibility of target enrichment using sequence capture in polyploid cotton. To capture and sequence both members of each gene pair (homeologs) of wild and domesticated Gossypium hirsutum, we created custom hybridization probes to target 1000 genes (500 pairs of homeologs) using information from the cotton transcriptome. Two widely divergent samples of G. hirsutum were hybridized to four custom NimbleGen capture arrays containing probes for targeted genes. We show that the two coresident homeologs in the allopolyploid nucleus were efficiently captured with high coverage. The capture efficiency was similar between the two accessions and independent of whether the samples were multiplexed. A significant amount of flanking, nontargeted sequence (untranslated regions and introns) was also captured and sequenced along with the targeted exons. Intraindividual heterozygosity is low in both wild and cultivated Upland cotton, as expected from the high level of inbreeding in natural G. hirsutum and bottlenecks accompanying domestication. In addition, levels of heterozygosity appeared asymmetrical with respect to genome (AT or DT) in cultivated cotton. The approach used here is general, scalable, and may be adapted for many different research inquiries involving polyploid plant genomes.
Gossypium; allopolyploidy; homoeologs; sequence capture; next-generation sequencing
Target enrichment and resequencing is a widely used approach for identification of cancer genes and genetic variants associated with diseases. Although cost effective compared to whole genome sequencing, analysis of many samples constitutes a significant cost, which could be reduced by pooling samples before capture. Another limitation to the number of cancer samples that can be analyzed is often the amount of available tumor DNA. We evaluated the performance of whole genome amplified DNA and the power to detect subclonal somatic single nucleotide variants in non-indexed pools of cancer samples using the HaloPlex technology for target enrichment and next generation sequencing.
We captured a set of 1528 putative somatic single nucleotide variants and germline SNPs, which were identified by whole genome sequencing, with the HaloPlex technology and sequenced to a depth of 792–1752. We found that the allele fractions of the analyzed variants are well preserved during whole genome amplification and that capture specificity or variant calling is not affected. We detected a large majority of the known single nucleotide variants present uniquely in one sample with allele fractions as low as 0.1 in non-indexed pools of up to ten samples. We also identified and experimentally validated six novel variants in the samples included in the pools.
Our work demonstrates that whole genome amplified DNA can be used for target enrichment equally well as genomic DNA and that accurate variant detection is possible in non-indexed pools of cancer samples. These findings show that analysis of a large number of samples is feasible at low cost, even when only small amounts of DNA is available, and thereby significantly increases the chances of indentifying recurrent mutations in cancer samples.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-14-856) contains supplementary material, which is available to authorized users.
Target enrichment; HaloPlex; Non-indexed pooling; Whole genome amplification; Single nucleotide variant; Deep sequencing
Non-human primates have emerged as an important resource for the study of human disease and evolution. The characterization of genomic variation between and within non-human primate species could advance the development of genetically defined non-human primate disease models. However, non-human primate specific reagents that would expedite such research, such as exon-capture tools, are lacking. We evaluated the efficiency of using a human exome capture design for the selective enrichment of exonic regions of non-human primates. We compared the exon sequence recovery in nine chimpanzees, two crab-eating macaques and eight Japanese macaques. Over 91% of the target regions were captured in the non-human primate samples, although the specificity of the capture decreased as evolutionary divergence from humans increased. Both intra-specific and inter-specific DNA variants were identified; Sanger-based resequencing validated 85.4% of 41 randomly selected SNPs. Among the short indels identified, a majority (54.6%–77.3%) of the variants resulted in a change of 3 base pairs, consistent with expectations for a selection against frame shift mutations. Taken together, these findings indicate that use of a human design exon-capture array can provide efficient enrichment of non-human primate gene regions. Accordingly, use of the human exon-capture methods provides an attractive, cost-effective approach for the comparative analysis of non-human primate genomes, including gene-based DNA variant discovery.
Mitochondrial disorders can originate from mutations in one of many nuclear genes controlling the organelle function or in the mitochondrial genome (mitochondrial DNA (mtDNA)). The large numbers of potential culprit genes, together with the little guidance offered by most clinical phenotypes as to which gene may be causative, are a great challenge for the molecular diagnosis of these disorders.
We developed a novel targeted resequencing assay for mitochondrial disorders relying on microarray-based hybrid capture coupled to next-generation sequencing. Specifically, we subjected the entire mtDNA genome and the exons and intron-exon boundary regions of 362 known or candidate causative nuclear genes to targeted capture and resequencing. We here provide proof-of-concept data by testing one HapMap DNA sample and two positive control samples.
Over 94% of the targeted regions were captured and sequenced with appropriate coverage and quality, allowing reliable variant calling. Pathogenic mutations blindly tested in patients' samples were 100% concordant with previous Sanger sequencing results: a known mutation in Pyruvate dehydrogenase alpha 1 subunit (PDHA1), a novel splicing and a known coding mutation in Hydroxyacyl-CoA dehydrogenase alpha subunit (HADHA) were correctly identified. Of the additional variants recognized, 90 to 94% were present in dbSNP while 6 to 10% represented new alterations. The novel nonsynonymous variants were all in heterozygote state and mostly predicted to be benign. The depth of sequencing coverage of mtDNA was extremely high, suggesting that it may be feasible to detect pathogenic mtDNA mutations confounded by low level heteroplasmy. Only one sequencing lane of an eight lane flow cell was utilized for each sample, indicating that a cost-effective clinical test can be achieved.
Our study indicates that the use of next generation sequencing technology holds great promise as a tool for screening mitochondrial disorders. The availability of a comprehensive molecular diagnostic tool will increase the capacity for early and rapid identification of mitochondrial disorders. In addition, the proposed approach has the potential to identify new mutations in candidate genes, expanding and redefining the spectrum of causative genes responsible for mitochondrial disorders.
Complementary techniques that deepen information content and minimize reagent costs are required to realize the full potential of massively parallel sequencing. Here, we describe a resequencing approach that directs focus to genomic regions of high interest by combining hybridization-based purification of multi-megabase regions with sequencing on the Illumina Genome Analyzer (GA). The capture matrix is created by a microarray on which probes can be programmed as desired to target any non-repeat portion of the genome, while the method requires only a basic familiarity with microarray hybridization. We present a detailed protocol suitable for 1–2 µg of input genomic DNA and highlight key design tips in which high specificity (>65% of reads stem from enriched exons) and high sensitivity (98% targeted base pair coverage) can be achieved. We have successfully applied this to the enrichment of coding regions, in both human and mouse, ranging from 0.5 to 4 Mb in length. From genomic DNA library production to base-called sequences, this procedure takes approximately 9–10 d inclusive of array captures and one Illumina flow cell run.
Current target enrichment systems for large-scale next-generation sequencing typically require synthetic oligonucleotides used as capture reagents to isolate sequences of interest. The majority of target enrichment reagents are focused on gene coding regions or promoters en masse. Here we introduce development of a customizable targeted capture system using biotinylated RNA probe baits transcribed from sheared bacterial artificial chromosome clone templates that enables capture of large, contiguous blocks of the genome for sequencing applications. This clone adapted template capture hybridization sequencing (CATCH-Seq) procedure can be used to capture both coding and non-coding regions of a gene, and resolve the boundaries of copy number variations within a genomic target site. Furthermore, libraries constructed with methylated adapters prior to solution hybridization also enable targeted bisulfite sequencing. We applied CATCH-Seq to diverse targets ranging in size from 125 kb to 3.5 Mb. Our approach provides a simple and cost effective alternative to other capture platforms because of template-based, enzymatic probe synthesis and the lack of oligonucleotide design costs. Given its similarity in procedure, CATCH-Seq can also be performed in parallel with commercial systems.
The goal of human genome re-sequencing is obtaining an accurate assembly of an individual's genome. Recently, there has been great excitement in the development of many technologies for this (e.g. medium and short read sequencing from companies such as 454 and SOLiD, and high-density oligo-arrays from Affymetrix and NimbelGen), with even more expected to appear. The costs and sensitivities of these technologies differ considerably from each other. As an important goal of personal genomics is to reduce the cost of re-sequencing to an affordable point, it is worthwhile to consider optimally integrating technologies. Here, we build a simulation toolbox that will help us optimally combine different technologies for genome re-sequencing, especially in reconstructing large structural variants (SVs). SV reconstruction is considered the most challenging step in human genome re-sequencing. (It is sometimes even harder than de novo assembly of small genomes because of the duplications and repetitive sequences in the human genome.) To this end, we formulate canonical problems that are representative of issues in reconstruction and are of small enough scale to be computationally tractable and simulatable. Using semi-realistic simulations, we show how we can combine different technologies to optimally solve the assembly at low cost. With mapability maps, our simulations efficiently handle the inhomogeneous repeat-containing structure of the human genome and the computational complexity of practical assembly algorithms. They quantitatively show how combining different read lengths is more cost-effective than using one length, how an optimal mixed sequencing strategy for reconstructing large novel SVs usually also gives accurate detection of SNPs/indels, how paired-end reads can improve reconstruction efficiency, and how adding in arrays is more efficient than just sequencing for disentangling some complex SVs. Our strategy should facilitate the sequencing of human genomes at maximum accuracy and low cost.
In recent years, the development of high throughput sequencing and array technologies has enabled the accurate re-sequencing of individual genomes, especially in identifying and reconstructing the variants in an individual's genome compared to a “reference”. The costs and sensitivities of these technologies differ considerably from each other, and even more technologies are expected to appear in the near future. To both reduce the total cost of re-sequencing to an affordable point and be adaptive to these constantly evolving bio-technologies, we propose to build a computationally efficient simulation framework that can help us optimize the combination of different technologies to perform low cost comparative genome re-sequencing, especially in reconstructing large structural variants, which is considered in many respects the most challenging step in genome re-sequencing. Our simulation results quantitatively show how much improvement one can gain in reconstructing large structural variants by integrating different technologies in optimal ways. We envision that in the future, more experimental technologies will be incorporated into this simulation framework and its results can provide informative guidelines for the actual experimental design to achieve optimal genome re-sequencing output at low costs.
Targeted sequencing is a cost-efficient way to obtain answers to biological questions in many projects, but the choice of the enrichment method to use can be difficult. In this study we compared two hybridization methods for target enrichment for massively parallel sequencing and single nucleotide polymorphism (SNP) discovery, namely Nimblegen sequence capture arrays and the SureSelect liquid-based hybrid capture system. We prepared sequencing libraries from three HapMap samples using both methods, sequenced the libraries on the Illumina Genome Analyzer, mapped the sequencing reads back to the genome, and called variants in the sequences. 74–75% of the sequence reads originated from the targeted region in the SureSelect libraries and 41–67% in the Nimblegen libraries. We could sequence up to 99.9% and 99.5% of the regions targeted by capture probes from the SureSelect libraries and from the Nimblegen libraries, respectively. The Nimblegen probes covered 0.6 Mb more of the original 3.1 Mb target region than the SureSelect probes. In each sample, we called more SNPs and detected more novel SNPs from the libraries that were prepared using the Nimblegen method. Thus the Nimblegen method gave better results when judged by the number of SNPs called, but this came at the cost of more over-sampling.
High coverage whole genome sequencing provides near complete information about genetic variation. However, other technologies can be more efficient in some settings by (a) reducing redundant coverage within samples and (b) exploiting patterns of genetic variation across samples. To characterize as many samples as possible, many genetic studies therefore employ lower coverage sequencing or SNP array genotyping coupled to statistical imputation. To compare these approaches individually and in conjunction, we developed a statistical framework to estimate genotypes jointly from sequence reads, array intensities, and imputation. In European samples, we find similar sensitivity (89%) and specificity (99.6%) from imputation with either 1× sequencing or 1 M SNP arrays. Sensitivity is increased, particularly for low-frequency polymorphisms (), when low coverage sequence reads are added to dense genome-wide SNP arrays — the converse, however, is not true. At sites where sequence reads and array intensities produce different sample genotypes, joint analysis reduces genotype errors and identifies novel error modes. Our joint framework informs the use of next-generation sequencing in genome wide association studies and supports development of improved methods for genotype calling.
In this work we address a series of questions prompted by the rise of next-generation sequencing as a data collection strategy for genetic studies. How does low coverage sequencing compare to traditional microarray based genotyping? Do studies increase sensitivity by collecting both sequencing and array data? What can we learn about technology error modes based on analysis of SNPs for which sequence and array data disagree? To answer these questions, we developed a statistical framework to estimate genotypes from sequence reads, array intensities, and imputation. Through experiments with intensity and read data from the Hapmap and 1000 Genomes (1000 G) Projects, we show that 1 M SNP arrays used for genome wide association studies perform similarly to 1× sequencing. We find that adding low coverage sequence reads to dense array data significantly increases rare variant sensitivity, but adding dense array data to low coverage sequencing has only a small impact. Finally, we describe an improved SNP calling algorithm used in the 1000 G project, inspired by a novel next-generation sequencing error mode identified through analysis of disputed SNPs. These results inform the use of next-generation sequencing in genetic studies and model an approach to further improve genotype calling methods.
In the event of biocrimes or infectious disease outbreaks, high-resolution genetic characterization for identifying the agent and attributing it to a specific source can be crucial for an effective response. Until recently, in-depth genetic characterization required expensive and time-consuming Sanger sequencing of a few strains, followed by genotyping of a small number of marker loci in a panel of isolates at or by gel-based approaches such as pulsed field gel electrophoresis, which by necessity ignores most of the genome. Next-generation, massively parallel sequencing (MPS) technology (specifically the Applied Biosystems sequencing by oligonucleotide ligation and detection (SOLiD™) system) is a powerful investigative tool for rapid, cost-effective and parallel microbial whole-genome characterization.
To demonstrate the utility of MPS for whole-genome typing of monomorphic pathogens, four Bacillus anthracis and four Yersinia pestis strains were sequenced in parallel. Reads were aligned to complete reference genomes, and genomic variations were identified. Resequencing of the B. anthracis Ames ancestor strain detected no false-positive single-nucleotide polymorphisms (SNPs), and mapping of reads to the Sterne strain correctly identified 98% of the 133 SNPs that are not clustered or associated with repeats. Three geographically distinct B. anthracis strains from the A branch lineage were found to have between 352 and 471 SNPs each, relative to the Ames genome, and one strain harbored a genomic amplification. Sequencing of four Y. pestis strains from the Orientalis lineage identified between 20 and 54 SNPs per strain relative to the CO92 genome, with the single Bolivian isolate having approximately twice as many SNPs as the three more closely related North American strains. Coverage plotting also revealed a common deletion in two strains and an amplification in the Bolivian strain that appear to be due to insertion element-mediated recombination events. Most private SNPs (that is, a, variant found in only one strain in this set) selected for validation by Sanger sequencing were confirmed, although rare false-positive SNPs were associated with variable nucleotide tandem repeats.
The high-throughput, multiplexing capability, and accuracy of this system make it suitable for rapid whole-genome typing of microbial pathogens during a forensic or epidemiological investigation. By interrogating nearly every base of the genome, rare polymorphisms can be reliably discovered, thus facilitating high-resolution strain tracking and strengthening forensic attribution.
Recent large-scale genomic studies within human populations have identified numerous genomic regions as copy number variant (CNV). As these CNV regions often overlap coding regions of the genome, large lists of potentially copy number polymorphic genes have been produced that are candidates for disease association. Most of the current data regarding normal genic variation, however, has been generated using BAC or SNP microarrays, which lack precision especially with respect to exons. To address this, we assessed 2,790 candidate CNV genes defined from available studies in nine well-characterized HapMap individuals by designing a customized oligonucleotide microarray targeted specifically to exons. Using exon array comparative genomic hybridization (aCGH), we detected 255 (9%) of the candidates as true CNVs including 134 with evidence of variation over the entire gene. Individuals differed in copy number from the control by an average of 100 gene loci. Both partial- and whole-gene CNVs were strongly associated with segmental duplications (55 and 71%, respectively) as well as regions of positive selection. We confirmed 37% of the whole-gene CNVs using the fosmid end sequence pair (ESP) structural variation map for these same individuals. If we modify the end sequence pair mapping strategy to include low-sequence identity ESPs (98–99.5%) and ESPs with an everted orientation, we can capture 82% of the missed genes leading to more complete ascertainment of structural variation within duplicated genes. Our results indicate that segmental duplications are the source of the majority of full-length copy number polymorphic genes, most of the variant genes are organized as tandem duplications, and a significant fraction of these genes will represent paralogs with levels of sequence diversity beyond thresholds of allelic variation. In addition, these data provide a targeted set of CNV genes enriched for regions likely to be associated with human phenotypic differences due to copy number changes and present a source of copy number responsive oligonucleotide probes for future association studies.