Compared to classical genotyping, targeted next-generation sequencing (tNGS) can be custom-designed to interrogate entire genomic regions of interest, in order to detect novel as well as known variants. To bring down the per-sample cost, one approach is to pool barcoded NGS libraries before sample enrichment. Still, we lack a complete understanding of how this multiplexed tNGS approach and the varying performance of the ever-evolving analytical tools can affect the quality of variant discovery. Therefore, we evaluated the impact of different software tools and analytical approaches on the discovery of single nucleotide polymorphisms (SNPs) in multiplexed tNGS data. To generate our own test model, we combined a sequence capture method with NGS in three experimental stages of increasing complexity (E. coli genes, multiplexed E. coli, and multiplexed HapMap BRCA1/2 regions).
We successfully enriched barcoded NGS libraries instead of genomic DNA, achieving reproducible coverage profiles (Pearson correlation coefficients of up to 0.99) across multiplexed samples, with <10% strand bias. However, the SNP calling quality was substantially affected by the choice of tools and mapping strategy. With the aim of reducing computational requirements, we compared conventional whole-genome mapping and SNP-calling with a new faster approach: target-region mapping with subsequent ‘read-backmapping’ to the whole genome to reduce the false detection rate. Consequently, we developed a combined mapping pipeline, which includes standard tools (BWA, SAMtools, etc.), and tested it on public HiSeq2000 exome data from the 1000 Genomes Project. Our pipeline saved 12 hours of run time per Hiseq2000 exome sample and detected ~5% more SNPs than the conventional whole genome approach. This suggests that more potential novel SNPs may be discovered using both approaches than with just the conventional approach.
We recommend applying our general ‘two-step’ mapping approach for more efficient SNP discovery in tNGS. Our study has also shown the benefit of computing inter-sample SNP-concordances and inspecting read alignments in order to attain more confident results.
Two-stage mapping; Read-backmapping; Software performance; SNP discovery; Multiplexed targeted next-generation sequencing
DNA barcodes are short unique sequences used to label DNA or RNA-derived samples in multiplexed deep sequencing experiments. During the demultiplexing step, barcodes must be detected and their position identified. In some cases (e.g., with PacBio SMRT), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives.
For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements.
In our analysis, barcode sequences showed high rates of coincidental similarities with the Mus musculus reference DNA. This problem became more acute when the length of the barcode sequence decreased and the number of barcodes in the set increased. The method presented in this paper controls the tail area-based false discovery rate to distinguish between barcoded and unbarcoded reads. This method helps to establish the highest acceptable minimal distance between reads and barcode sequences. In a proof of concept experiment we correctly detected barcodes in 83% of the reads with a precision of 89%. Sensitivity improved to 99% at 99% precision when the adjacent primer sequence was incorporated in the analysis. The analysis was further improved using a paired end strategy. Following an analysis of the data for sequence variants induced in the Atp1a1 gene of C57BL/6 murine melanocytes by ultraviolet light and conferring resistance to ouabain, we found no evidence of cross-contamination of DNA material between samples.
Our method offers a proper quantitative treatment of the problem of detecting barcoded reads in a noisy sequencing environment. It is based on the false discovery rate statistics that allows a proper trade-off between sensitivity and precision to be chosen.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-264) contains supplementary material, which is available to authorized users.
Detection of the rare polymorphisms and causative mutations of genetic diseases in a targeted genomic area has become a major goal in order to understand genomic and phenotypic variability. We have interrogated repeat-masked regions of 8.9 Mb on human chromosomes 21 (7.8 Mb) and 7 (1.1 Mb) from an individual from the International HapMap Project (NA12872). We have optimized a method of genomic selection for high throughput sequencing. Microarray-based selection and sequencing resulted in 260-fold enrichment, with 41% of reads mapping to the target region. 83% of SNPs in the targeted region had at least 4-fold sequence coverage and 54% at least 15-fold. When assaying HapMap SNPs in NA12872, our sequence genotypes are 91.3% concordant in regions with coverage≥4-fold, and 97.9% concordant in regions with coverage≥15-fold. About 81% of the SNPs recovered with both thresholds are listed in dbSNP. We observed that regions with low sequence coverage occur in close proximity to low-complexity DNA. Validation experiments using Sanger sequencing were performed for 46 SNPs with 15-20 fold coverage, with a confirmation rate of 96%, suggesting that DNA selection provides an accurate and cost-effective method for identifying rare genomic variants.
U87MG is a commonly studied grade IV glioma cell line that has been analyzed in at least 1,700 publications over four decades. In order to comprehensively characterize the genome of this cell line and to serve as a model of broad cancer genome sequencing, we have generated greater than 30× genomic sequence coverage using a novel 50-base mate paired strategy with a 1.4kb mean insert library. A total of 1,014,984,286 mate-end and 120,691,623 single-end two-base encoded reads were generated from five slides. All data were aligned using a custom designed tool called BFAST, allowing optimal color space read alignment and accurate identification of DNA variants. The aligned sequence reads and mate-pair information identified 35 interchromosomal translocation events, 1,315 structural variations (>100 bp), 191,743 small (<21 bp) insertions and deletions (indels), and 2,384,470 single nucleotide variations (SNVs). Among these observations, the known homozygous mutation in PTEN was robustly identified, and genes involved in cell adhesion were overrepresented in the mutated gene list. Data were compared to 219,187 heterozygous single nucleotide polymorphisms assayed by Illumina 1M Duo genotyping array to assess accuracy: 93.83% of all SNPs were reliably detected at filtering thresholds that yield greater than 99.99% sequence accuracy. Protein coding sequences were disrupted predominantly in this cancer cell line due to small indels, large deletions, and translocations. In total, 512 genes were homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and 35 by interchromosomal translocations to reveal a highly mutated cell line genome. Of the small homozygously mutated variants, 8 SNVs and 99 indels were novel events not present in dbSNP. These data demonstrate that routine generation of broad cancer genome sequence is possible outside of genome centers. The sequence analysis of U87MG provides an unparalleled level of mutational resolution compared to any cell line to date.
Glioblastoma has a particularly dismal prognosis with median survival time of less than fifteen months. Here, we describe the broad genome sequencing of U87MG, a commonly used and thus well-studied glioblastoma cell line. One of the major features of the U87MG genome is the large number of chromosomal abnormalities, which can be typical of cancer cell lines and primary cancers. The systematic, thorough, and accurate mutational analysis of the U87MG genome comprehensively identifies different classes of genetic mutations including single-nucleotide variations (SNVs), insertions/deletions (indels), and translocations. We found 2,384,470 SNVs, 191,743 small indels, and 1,314 large structural variations. Known gene models were used to predict the effect of these mutations on protein-coding sequence. Mutational analysis revealed 512 genes homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and up to 35 by interchromosomal translocations. The major mutational mechanisms in this brain cancer cell line are small indels and large structural variations. The genomic landscape of U87MG is revealed to be much more complex than previously thought based on lower resolution techniques. This mutational analysis serves as a resource for past and future studies on U87MG, informing them with a thorough description of its mutational state.
Targeted genomic enrichment (TGE) is a widely used method for isolating and enriching specific genomic regions prior to massively parallel sequencing. To make effective use of sequencer output, barcoding and sample pooling (multiplexing) after TGE and prior to sequencing (post-capture multiplexing) has become routine. While previous reports have indicated that multiplexing prior to capture (pre-capture multiplexing) is feasible, no thorough examination of the effect of this method has been completed on a large number of samples. Here we compare standard post-capture TGE to two levels of pre-capture multiplexing: 12 or 16 samples per pool. We evaluated these methods using standard TGE metrics and determined the ability to identify several classes of genetic mutations in three sets of 96 samples, including 48 controls. Our overall goal was to maximize cost reduction and minimize experimental time while maintaining a high percentage of reads on target and a high depth of coverage at thresholds required for variant detection.
We adapted the standard post-capture TGE method for pre-capture TGE with several protocol modifications, including redesign of blocking oligonucleotides and optimization of enzymatic and amplification steps. Pre-capture multiplexing reduced costs for TGE by at least 38% and significantly reduced hands-on time during the TGE protocol. We found that pre-capture multiplexing reduced capture efficiency by 23 or 31% for pre-capture pools of 12 and 16, respectively. However efficiency losses at this step can be compensated by reducing the number of simultaneously sequenced samples. Pre-capture multiplexing and post-capture TGE performed similarly with respect to variant detection of positive control mutations. In addition, we detected no instances of sample switching due to aberrant barcode identification.
Pre-capture multiplexing improves efficiency of TGE experiments with respect to hands-on time and reagent use compared to standard post-capture TGE. A decrease in capture efficiency is observed when using pre-capture multiplexing; however, it does not negatively impact variant detection and can be accommodated by the experimental design.
Massively parallel sequencing; Next-generation sequencing; Genomics; Targeted genomic enrichment; Sequence capture; Pre-capture multiplexing; Post-capture multiplexing; Indexing
In the event of biocrimes or infectious disease outbreaks, high-resolution genetic characterization for identifying the agent and attributing it to a specific source can be crucial for an effective response. Until recently, in-depth genetic characterization required expensive and time-consuming Sanger sequencing of a few strains, followed by genotyping of a small number of marker loci in a panel of isolates at or by gel-based approaches such as pulsed field gel electrophoresis, which by necessity ignores most of the genome. Next-generation, massively parallel sequencing (MPS) technology (specifically the Applied Biosystems sequencing by oligonucleotide ligation and detection (SOLiD™) system) is a powerful investigative tool for rapid, cost-effective and parallel microbial whole-genome characterization.
To demonstrate the utility of MPS for whole-genome typing of monomorphic pathogens, four Bacillus anthracis and four Yersinia pestis strains were sequenced in parallel. Reads were aligned to complete reference genomes, and genomic variations were identified. Resequencing of the B. anthracis Ames ancestor strain detected no false-positive single-nucleotide polymorphisms (SNPs), and mapping of reads to the Sterne strain correctly identified 98% of the 133 SNPs that are not clustered or associated with repeats. Three geographically distinct B. anthracis strains from the A branch lineage were found to have between 352 and 471 SNPs each, relative to the Ames genome, and one strain harbored a genomic amplification. Sequencing of four Y. pestis strains from the Orientalis lineage identified between 20 and 54 SNPs per strain relative to the CO92 genome, with the single Bolivian isolate having approximately twice as many SNPs as the three more closely related North American strains. Coverage plotting also revealed a common deletion in two strains and an amplification in the Bolivian strain that appear to be due to insertion element-mediated recombination events. Most private SNPs (that is, a, variant found in only one strain in this set) selected for validation by Sanger sequencing were confirmed, although rare false-positive SNPs were associated with variable nucleotide tandem repeats.
The high-throughput, multiplexing capability, and accuracy of this system make it suitable for rapid whole-genome typing of microbial pathogens during a forensic or epidemiological investigation. By interrogating nearly every base of the genome, rare polymorphisms can be reliably discovered, thus facilitating high-resolution strain tracking and strengthening forensic attribution.
Motivation: Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing.
Results: We describe a novel statistical approach, CRISP [Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms (SNPs) from Pooled sequencing] that is able to identify both rare and common variants by using two approaches: (i) comparing the distribution of allele counts across multiple pools using contingency tables and (ii) evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer demonstrates that it can detect 80–85% of SNPs identified using individual sequencing while achieving a low false discovery rate (3–5%). Comparison with previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP.
Availability: Implementation of this method is available at http://polymorphism.scripps.edu/∼vbansal/software/CRISP/
Next-generation DNA sequencing is opening new avenues for genetic association studies in common diseases that, like deep vein thrombosis (DVT), have a strong genetic predisposition still largely unexplained by currently identified risk variants. In order to develop sequencing and analytical pipelines for the application of next-generation sequencing to complex diseases, we conducted a pilot study sequencing the coding area of 186 hemostatic/proinflammatory genes in 10 Italian cases of idiopathic DVT and 12 healthy controls.
A molecular-barcoding strategy was used to multiplex DNA target capture and sequencing, while retaining individual sequence information. Genomic libraries with barcode sequence-tags were pooled (in pools of 8 or 16 samples) and enriched for target DNA sequences. Sequencing was performed on ABI SOLiD-4 platforms. We produced > 12 gigabases of raw sequence data to sequence at high coverage (average: 42X) the 700-kilobase target area in 22 individuals. A total of 1876 high-quality genetic variants were identified (1778 single nucleotide substitutions and 98 insertions/deletions). Annotation on databases of genetic variation and human disease mutations revealed several novel, potentially deleterious mutations. We tested 576 common variants in a case-control association analysis, carrying the top-5 associations over to replication in up to 719 DVT cases and 719 controls. We also conducted an analysis of the burden of nonsynonymous variants in coagulation factor and anticoagulant genes. We found an excess of rare missense mutations in anticoagulant genes in DVT cases compared to controls and an association for a missense polymorphism of FGA (rs6050; p = 1.9 × 10-5, OR 1.45; 95% CI, 1.22-1.72; after replication in > 1400 individuals).
We implemented a barcode-based strategy to efficiently multiplex sequencing of hundreds of candidate genes in several individuals. In the relatively small dataset of our pilot study we were able to identify bona fide associations with DVT. Our study illustrates the potential of next-generation sequencing for the discovery of genetic variation predisposing to complex diseases.
Deep vein thrombosis; venous thromboembolism; next-generation sequencing; target capture; multiplexing; FGA; rs6025; heamostateome; DVT; VTE
The recent advancement in human genome sequencing and genotyping has revealed millions of single nucleotide polymorphisms (SNP) which determine the variation among human beings. One of the particular important projects is The International HapMap Project which provides the catalogue of human genetic variation for disease association studies. In this paper, we analyzed the genotype data in HapMap project by using National Institute of Environmental Health Sciences Environmental Genome Project (NIEHS EGP) SNPs. We first determine whether the HapMap data are transferable to the NIEHS data. Then, we study how well the HapMap SNPs capture the untyped SNPs in the region. Finally, we provide general guidelines for determining whether the SNPs chosen from HapMap may be able to capture most of the untyped SNPs.
Our analysis shows that HapMap data are not robust enough to capture the untyped variants for most of the human genes. The performance of SNPs for European and Asian samples are marginal in capturing the untyped variants, i.e. approximately 55%. Expectedly, the SNPs from HapMap YRI panel can only capture approximately 30% of the variants. Although the overall performance is low, however, the SNPs for some genes perform very well and are able to capture most of the variants along the gene. This is observed in the European and Asian panel, but not in African panel. Through observation, we concluded that in order to have a well covered SNPs reference panel, the SNPs density and the association among reference SNPs are important to estimate the robustness of the chosen SNPs.
We have analyzed the coverage of HapMap SNPs using NIEHS EGP data. The results show that HapMap SNPs are transferable to the NIEHS SNPs. However, HapMap SNPs cannot capture some of the untyped SNPs and therefore resequencing may be needed to uncover more SNPs in the missing region.
Mitochondrial disorders can originate from mutations in one of many nuclear genes controlling the organelle function or in the mitochondrial genome (mitochondrial DNA (mtDNA)). The large numbers of potential culprit genes, together with the little guidance offered by most clinical phenotypes as to which gene may be causative, are a great challenge for the molecular diagnosis of these disorders.
We developed a novel targeted resequencing assay for mitochondrial disorders relying on microarray-based hybrid capture coupled to next-generation sequencing. Specifically, we subjected the entire mtDNA genome and the exons and intron-exon boundary regions of 362 known or candidate causative nuclear genes to targeted capture and resequencing. We here provide proof-of-concept data by testing one HapMap DNA sample and two positive control samples.
Over 94% of the targeted regions were captured and sequenced with appropriate coverage and quality, allowing reliable variant calling. Pathogenic mutations blindly tested in patients' samples were 100% concordant with previous Sanger sequencing results: a known mutation in Pyruvate dehydrogenase alpha 1 subunit (PDHA1), a novel splicing and a known coding mutation in Hydroxyacyl-CoA dehydrogenase alpha subunit (HADHA) were correctly identified. Of the additional variants recognized, 90 to 94% were present in dbSNP while 6 to 10% represented new alterations. The novel nonsynonymous variants were all in heterozygote state and mostly predicted to be benign. The depth of sequencing coverage of mtDNA was extremely high, suggesting that it may be feasible to detect pathogenic mtDNA mutations confounded by low level heteroplasmy. Only one sequencing lane of an eight lane flow cell was utilized for each sample, indicating that a cost-effective clinical test can be achieved.
Our study indicates that the use of next generation sequencing technology holds great promise as a tool for screening mitochondrial disorders. The availability of a comprehensive molecular diagnostic tool will increase the capacity for early and rapid identification of mitochondrial disorders. In addition, the proposed approach has the potential to identify new mutations in candidate genes, expanding and redefining the spectrum of causative genes responsible for mitochondrial disorders.
Many disease-associated variants identified by genome-wide association (GWA) studies are expected to regulate gene expression. Allele-specific expression (ASE) quantifies transcription from both haplotypes using individuals heterozygous at tested SNPs. We performed deep human transcriptome-wide resequencing (RNA-seq) for ASE analysis and expression quantitative trait locus discovery. We resequenced double poly(A)-selected RNA from primary CD4+ T cells (n = 4 individuals, both activated and untreated conditions) and developed tools for paired-end RNA-seq alignment and ASE analysis. We generated an average of 20 million uniquely mapping 45 base reads per sample. We obtained sufficient read depth to test 1371 unique transcripts for ASE. Multiple biases inflate the false discovery rate which we estimate to be ∼50% for random SNPs. However, after controlling for these biases and considering the subset of SNPs that pass HapMap QC, 4.6% of heterozygous SNP-sample pairs show evidence of imbalance (P < 0.001). We validated four findings by both bacterial cloning and Sanger sequencing assays. We also found convincing evidence for allelic imbalance at multiple reporter exonic SNPs in CD6 for two samples heterozygous at the multiple sclerosis-associated variant rs17824933, linking GWA findings with variation in gene expression. Finally, we show in CD4+ T cells from a further individual that high-throughput sequencing of genomic DNA and RNA-seq following enrichment for targeted gene sequences by sequence capture methods offers an unbiased means to increase the read depth for transcripts of interest, and therefore a method to investigate the regulatory role of many disease-associated genetic variants.
Rare genetic variation in the human population is a major source of pathophysiological variability and has been implicated in a host of complex phenotypes and diseases. Finding disease-related genes harboring disparate functional rare variants requires sequencing of many individuals across many genomic regions and comparing against unaffected cohorts. However, despite persistent declines in sequencing costs, population-based rare variant detection across large genomic target regions remains cost prohibitive for most investigators. In addition, DNA samples are often precious and hybridization methods typically require large amounts of input DNA. Pooled sample DNA sequencing is a cost and time-efficient strategy for surveying populations of individuals for rare variants. We set out to 1) create a scalable, multiplexing method for custom capture with or without individual DNA indexing that was amenable to low amounts of input DNA and 2) expand the functionality of the SPLINTER algorithm for calling substitutions, insertions and deletions across either candidate genes or the entire exome by integrating the variant calling algorithm with the dynamic programming aligner, Novoalign.
We report methodology for pooled hybridization capture with pre-enrichment, indexed multiplexing of up to 48 individuals or non-indexed pooled sequencing of up to 92 individuals with as little as 70 ng of DNA per person. Modified solid phase reversible immobilization bead purification strategies enable no sample transfers from sonication in 96-well plates through adapter ligation, resulting in 50% less library preparation reagent consumption. Custom Y-shaped adapters containing novel 7 base pair index sequences with a Hamming distance of ≥2 were directly ligated onto fragmented source DNA eliminating the need for PCR to incorporate indexes, and was followed by a custom blocking strategy using a single oligonucleotide regardless of index sequence. These results were obtained aligning raw reads against the entire genome using Novoalign followed by variant calling of non-indexed pools using SPLINTER or SAMtools for indexed samples. With these pipelines, we find sensitivity and specificity of 99.4% and 99.7% for pooled exome sequencing. Sensitivity, and to a lesser degree specificity, proved to be a function of coverage. For rare variants (≤2% minor allele frequency), we achieved sensitivity and specificity of ≥94.9% and ≥99.99% for custom capture of 2.5 Mb in multiplexed libraries of 22–48 individuals with only ≥5-fold coverage/chromosome, but these parameters improved to ≥98.7 and 100% with 20-fold coverage/chromosome.
This highly scalable methodology enables accurate rare variant detection, with or without individual DNA sample indexing, while reducing the amount of required source DNA and total costs through less hybridization reagent consumption, multi-sample sonication in a standard PCR plate, multiplexed pre-enrichment pooling with a single hybridization and lesser sequencing coverage required to obtain high sensitivity.
Rare variants; Genomics; Exome; Hybridization capture; Multiplexed capture; Indexed capture; SPLINTER
Massively parallel sequencing of barcoded DNA samples significantly increases screening efficiency for clinically important genes. Short read aligners are well suited to single nucleotide and indel detection. However, methods for CNV detection from targeted enrichment are lacking. We present a method combining coverage with map information for the identification of deletions and duplications in targeted sequence data.
Sequencing data is first scanned for gains and losses using a comparison of normalized coverage data between samples. CNV calls are confirmed by testing for a signature of sequences that span the CNV breakpoint. With our method, CNVs can be identified regardless of whether breakpoints are within regions targeted for sequencing. For CNVs where at least one breakpoint is within targeted sequence, exact CNV breakpoints can be identified. In a test data set of 96 subjects sequenced across ~1 Mb genomic sequence using multiplexing technology, our method detected mutations as small as 31 bp, predicted quantitative copy count, and had a low false-positive rate.
Application of this method allows for identification of gains and losses in targeted sequence data, providing comprehensive mutation screening when combined with a short read aligner.
Rapid, accurate, and inexpensive sequencing of exomes is critical to understand DNA variation in human disease. Ion Torrent has developed a benchtop research semiconductor sequencer, the Ion Proton™, that uses a novel CMOS chip with 165 million 1.3mm-diameter microwells, automatically templated sub-micron particles, and integrated hardware and software that enables acquisition of ∼5 billion data points per second over a 2-4 hour runtime with on-instrument signal processing.
To illustrate the speed, accuracy, and ease-of-use of the Proton system, analysis of a HapMap familial trio of exomes will be presented. Exome libraries are obtained with high-specificity hybridization probes targeting ∼50 Mb of human exons that span 21,700 annotated protein-coding genes, microRNA, key non-coding RNA genes, and 44,000 predicted microRNA binding sites. Exome reads map on-target 75-83% between runs and 10.6 Gb of aligned data, obtained from a single P1 chip, yielded 141X average depth with 30X coverage of 90% of targeted bases. Read mapping, coverage analysis, variant calling and annotation are done with Torrent Suite and Ion Reporter™ software. Each trio dataset yielded ∼30,000 SNP calls from single runs that exceeded 9 Gb of aligned data. The observed Het:Hom ratio of 1.4-1.5 matches the published range of 1.25-1.7 for European ethnicity and the observed Ts:Tv ratio of 2.9 agrees well with the published range of 2.8-3.1 for human exomes. The SNP concordance with dbSNP137 is greater than 98% and Het and Hom concordances with Complete Genomics data are 98% and 96%, respectively. Mendelian inheritance analysis indicates that error for Hets is 0.6% with no errors for homozygotic SNPs. The Proton system delivers high-quality individual exome datasets rapidly and can be used for trio analysis to detect shared germline SNPs with high confidence.
The Ion Proton™ System is for research use only and not for use in diagnostic procedures.
Single nucleotide polymorphisms (SNPs) in the KLK3 gene on chromosome 19q13.33 are associated with serum prostate-specific antigen (PSA) levels. Recent genome wide association studies of prostate cancer have yielded conflicting results for association of the same SNPs with prostate cancer risk. Since the KLK3 gene encodes the PSA protein that forms the basis for a widely used screening test for prostate cancer, it is critical to fully characterize genetic variation in this region and assess its relationship with the risk of prostate cancer. We have conducted a next-generation sequence analysis in 78 individuals of European ancestry to characterize common (minor allele frequency, MAF >1%) genetic variation in a 56 kb region on chromosome 19q13.33 centered on the KLK3 gene (chr19:56,019,829–56,076,043 bps). We identified 555 polymorphic loci in the process including 116 novel SNPs and 182 novel insertion/deletion polymorphisms (indels). Based on tagging analysis, 144 loci are necessary to tag the region at an r2 threshold of 0.8 and MAF of 1% or higher, while 86 loci are required to tag the region at an r2 threshold of 0.8 and MAF >5%. Our sequence data augments coverage by 35 and 78% as compared to variants in dbSNP and HapMap, respectively. We observed six non-synonymous amino acid or frame shift changes in the KLK3 gene and three changes in each of the neighboring genes, KLK15 and KLK2. Our study has generated a detailed map of common genetic variation in the genomic region surrounding the KLK3 gene, which should be useful for fine-mapping the association signal as well as determining the contribution of this locus to prostate cancer risk and/or regulation of PSA expression.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-009-0751-5) contains supplementary material, which is available to authorized users.
Demand has never been greater for revolutionary technologies that deliver fast, inexpensive and accurate genome information. Massively parallel sequencing technologies have enabled scientists to discover rare mutations, structural variants, and novel transcripts at an unprecedented rate. To meet the demand for fast, inexpensive and accurate genome analysis method, Agilent Technologies has developed the SureSelect platform, an in-solution hybrid selection technology for systematic re-sequencing of user specific genomic regions. With the implementation of this new technology there is a balancing act of cost, quality and quantity and it is easier for scientists to sequence entire genomes from large sample cohorts. The inexpensive production of large volumes of user specific sequence data is SureSelect's primary advantage over conventional methods. To further reduce costs and take advantage of the increasing capacity of next-generation sequencers, such as the HiSeq2000 and the SOLiD4/4hq, we highlight the ability to multiplex DNA samples in a single sequencing lane/slide while maintaining the coverage necessary to confidently make SNP calls. SureSlelect multiplexing kits have an automation-friendly, easy to use protocol where gDNA libraries are uniquely “tagged” and then combined via mass balance on one flow cell lane/slide. We show high performance across both Illumina and SOLiD multiplexing platforms, as measured by capture efficiency, uniformity and reproducibility. The multiplexing capabilities SureSelect make it a cost effective way to study human and mouse exome, or any user defined region of interest. When multiplexing HapMap samples, >98% concordance between SureSelect re-sequencing results and previously determined genotype is observed. Lastly, we introduce the SureSelect XT kit for preparation of samples for multiplex sequencing using the Illumina GAII or HiSeq. The SureSelect Multiplexing kit provides the ability to combine targeted enrichment with multiplexing, thus maximizing the number of samples that can be sequenced at one time, providing optimum time and cost savings without sacrificing performance.
Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics.
Comparison of polymorphism at synonymous and non-synonymous sites in protein-coding DNA can provide evidence for selective constraint. Non-coding DNA that forms part of the regulatory landscape presents more of a challenge since there is not such a clear-cut distinction between sites under stronger and weaker selective constraint. Here, we consider putative regulatory elements termed Conserved Non-coding Elements (CNEs) defined by their high level of sequence identity across all vertebrates. Some mutations in these regions have been implicated in developmental disorders; we analyse CNE polymorphism data to investigate whether such deleterious effects are widespread in humans. Single nucleotide variants from the HapMap and 1000 Genomes Projects were mapped across nearly 2000 CNEs. In the 1000 Genomes data we find a significant excess of rare derived alleles in CNEs relative to coding sequences; this pattern is absent in HapMap data, apparently obscured by ascertainment bias. The distribution of polymorphism within CNEs is not uniform; we could identify two categories of sites by exploiting deep vertebrate alignments: stretches that are non-variant, and those that have at least one substitution. The conserved category has fewer polymorphic sites and a greater excess of rare derived alleles, which can be explained by a large proportion of sites under strong purifying selection within humans – higher than that for non-synonymous sites in most protein coding regions, and comparable to that at the strongly conserved trans-dev genes. Conversely, the more evolutionarily labile CNE sites have an allele frequency distribution not significantly different from non-synonymous sites. Future studies should exploit genome-wide re-sequencing to obtain better coverage in selected non-coding regions, given the likelihood that mutations in evolutionarily conserved enhancer sequences are deleterious. Discovery pipelines should validate non-coding variants to aid in identifying causal and risk-enhancing variants in complex disorders, in contrast to the current focus on exome sequencing.
We describe methods for rapid sequencing of the entire human mitochondrial genome (mtgenome), which involve long-range PCR for specific amplification of the mtgenome, pyrosequencing, quantitative mapping of sequence reads to identify sequence variants and heteroplasmy, as well as de novo sequence assembly. These methods have been used to study 40 publicly available HapMap samples of European (CEU) and African (YRI) ancestry to demonstrate a sequencing error rate <5.63×10−4, nucleotide diversity of 1.6×10−3 for CEU and 3.7×10−3 for YRI, patterns of sequence variation consistent with earlier studies, but a higher rate of heteroplasmy varying between 10% and 50%. These results demonstrate that next-generation sequencing technologies allow interrogation of the mitochondrial genome in greater depth than previously possible which may be of value in biology and medicine.
This manuscript details a novel algorithm to evaluate high-throughput DNA sequence data from whole mitochondrial genomes purified from genomic DNA, which also contains multiple fragmented nuclear copies of mtgenomes (numts). 40 samples were selected from 2 distinct reference (HapMap) populations of African (YRI) and European (CEU) origin. While previous technologies did not allow the assessment of individual mitochondrial molecules, next-generation sequencing technology is an excellent tool for obtaining the mtgenome sequence and its heteroplasmic sites rapidly and accurately through deep coverage of the genome. The computational techniques presented optimize reference-based alignments and introduce a new de novo assembly method. An important contribution of our study was obtaining high accuracy of the resulting called bases that we accomplished by quantitative filtering of reads that were error prone. In addition, several sites were experimentally validated and our method has a strong correlation (R2 = 0.96) with the NIST standard reference sample for heteroplasmy. Overall, our findings indicate that one can now confidently genotype mtDNA variants using next-generation sequencing data and reveal low levels of heteroplasmy (>10%). Beyond enriching our understanding and pathology of certain diseases, this development could be considered as a prelude to sequence-based individualized medicine for the mtgenome.
Identification of genes responsible for medically important traits is a major challenge in human genetics. Due to the genetic heterogeneity of hearing loss, targeted DNA capture and massively parallel sequencing are ideal tools to address this challenge. Our subjects for genome analysis are Israeli Jewish and Palestinian Arab families with hearing loss that varies in mode of inheritance and severity.
A custom 1.46 MB design of cRNA oligonucleotides was constructed containing 246 genes responsible for either human or mouse deafness. Paired-end libraries were prepared from 11 probands and bar-coded multiplexed samples were sequenced to high depth of coverage. Rare single base pair and indel variants were identified by filtering sequence reads against polymorphisms in dbSNP132 and the 1000 Genomes Project. We identified deleterious mutations in CDH23, MYO15A, TECTA, TMC1, and WFS1. Critical mutations of the probands co-segregated with hearing loss. Screening of additional families in a relevant population was performed. TMC1 p.S647P proved to be a founder allele, contributing to 34% of genetic hearing loss in the Moroccan Jewish population.
Critical mutations were identified in 6 of the 11 original probands and their families, leading to the identification of causative alleles in 20 additional probands and their families. The integration of genomic analysis into early clinical diagnosis of hearing loss will enable prediction of related phenotypes and enhance rehabilitation. Characterization of the proteins encoded by these genes will enable an understanding of the biological mechanisms involved in hearing loss.
Short-read high-throughput DNA sequencing technologies provide new tools to answer biological questions. However, high cost and low throughput limit their widespread use, particularly in organisms with smaller genomes such as S. cerevisiae. Although ChIP-Seq in mammalian cell lines is replacing array-based ChIP-chip as the standard for transcription factor binding studies, ChIP-Seq in yeast is still underutilized compared to ChIP-chip. We developed a multiplex barcoding system that allows simultaneous sequencing and analysis of multiple samples using Illumina's platform. We applied this method to analyze the chromosomal distributions of three yeast DNA binding proteins (Ste12, Cse4 and RNA PolII) and a reference sample (input DNA) in a single experiment and demonstrate its utility for rapid and accurate results at reduced costs.
We developed a barcoding ChIP-Seq method for the concurrent analysis of transcription factor binding sites in yeast. Our multiplex strategy generated high quality data that was indistinguishable from data obtained with non-barcoded libraries. None of the barcoded adapters induced differences relative to a non-barcoded adapter when applied to the same DNA sample. We used this method to map the binding sites for Cse4, Ste12 and Pol II throughout the yeast genome and we found 148 binding targets for Cse4, 823 targets for Ste12 and 2508 targets for PolII. Cse4 was strongly bound to all yeast centromeres as expected and the remaining non-centromeric targets correspond to highly expressed genes in rich media. The presence of Cse4 non-centromeric binding sites was not reported previously.
We designed a multiplex short-read DNA sequencing method to perform efficient ChIP-Seq in yeast and other small genome model organisms. This method produces accurate results with higher throughput and reduced cost. Given constant improvements in high-throughput sequencing technologies, increasing multiplexing will be possible to further decrease costs per sample and to accelerate the completion of large consortium projects such as modENCODE.
Advances in high-throughput genotyping and the International HapMap Project have enabled association studies at the whole-genome level. We have constructed whole-genome genotyping panels of over 550,000 (HumanHap550) and 650,000 (HumanHap650Y) SNP loci by choosing tag SNPs from all populations genotyped by the International HapMap Project. These panels also contain additional SNP content in regions that have historically been overrepresented in diseases, such as nonsynonymous sites, the MHC region, copy number variant regions and mitochondrial DNA. We estimate that the tag SNP loci in these panels cover the majority of all common variation in the genome as measured by coverage of both all common HapMap SNPs and an independent set of SNPs derived from complete resequencing of genes obtained from SeattleSNPs. We also estimate that, given a sample size of 1,000 cases and 1,000 controls, these panels have the power to detect single disease loci of moderate risk (λ ∼ 1.8–2.0). Relative risks as low as λ ∼ 1.1–1.3 can be detected using 10,000 cases and 10,000 controls depending on the sample population and disease model. If multiple loci are involved, the power increases significantly to detect at least one locus such that relative risks 20%–35% lower can be detected with 80% power if between two and four independent loci are involved. Although our SNP selection was based on HapMap data, which is a subset of all common SNPs, these panels effectively capture the majority of all common variation and provide high power to detect risk alleles that are not represented in the HapMap data.
Advances in high-throughput genotyping technology and the International HapMap Project have enabled genetic association studies at the whole-genome level. Our paper describes two genome-wide SNP panels that contain tag SNPs derived from the International HapMap Project. Tag SNPs are proxies for groups of highly correlated SNPs. Information can be captured for the entire group of correlated SNPs by genotyping only one representative SNP, the tag SNP. These whole-genome SNP panels also contain additional content thought to be overrepresented in disease, such as amino acid–changing nonsynonymous SNPs and mitochondrial SNPs. We show that these panels cover the genome with very high efficiency as measured by coverage of all HapMap SNPs and a set of SNPs derived from completely resequenced genes from the Seattle SNPs database. We also show that these panels have high power to detect disease risk alleles for both HapMap and non-HapMap SNPs. In complex disease where multiple risk alleles are believed to be involved, we show that the ability to detect at least one risk allele with the tag SNP panels is also high.
Many hypothesis-driven genetic studies require the ability to comprehensively and efficiently target specific regions of the genome to detect sequence variations. Often, sample availability is limited requiring the use of whole genome amplification (WGA). We evaluated a high-throughput microdroplet-based PCR approach in combination with next generation sequencing (NGS) to target 384 discrete exons from 373 genes involved in cancer. In our evaluation, we compared the performance of six non-amplified gDNA samples from two HapMap family trios. Three of these samples were also preamplified by WGA and evaluated. We tested sample pooling or multiplexing strategies at different stages of the tested targeted NGS (T-NGS) workflow.
The results demonstrated comparable sequence performance between non-amplified and preamplified samples and between different indexing strategies [sequence specificity of 66.0% ± 3.4%, uniformity (coverage at 0.2× of the mean) of 85.6% ± 0.6%]. The average genotype concordance maintained across all the samples was 99.5% ± 0.4%, regardless of sample type or pooling strategy. We did not detect any errors in the Mendelian patterns of inheritance of genotypes between the parents and offspring within each trio. We also demonstrated the ability to detect minor allele frequencies within the pooled samples that conform to predicted models.
Our described PCR-based sample multiplex approach and the ability to use WGA material for NGS may enable researchers to perform deep resequencing studies and explore variants at very low frequencies and cost.
High-throughput targeted next-generation resequencing; Microdroplet-based multiplex PCR; Sample pooling or multiplexing; Whole-genome amplified DNA samples; Cost reduction
Since publication of the human genome in 2003, geneticists have been interested in risk variant associations to resolve the etiology of traits and complex diseases. The International HapMap Consortium undertook an effort to catalog all common variation across the genome (variants with a minor allele frequency (MAF) of at least 5% in one or more ethnic groups). HapMap along with advances in genotyping technology led to genome-wide association studies which have identified common variants associated with many traits and diseases. In 2008 the 1000 Genomes Project aimed to sequence 2500 individuals and identify rare variants and 99% of variants with a MAF of <1%.
To determine whether the 1000 Genomes Project includes all the variants in HapMap, we examined the overlap between single nucleotide polymorphisms (SNPs) genotyped in the two resources using merged phase II/III HapMap data and low coverage pilot data from 1000 Genomes.
Comparison of the two data sets showed that approximately 72% of HapMap SNPs were also found in 1000 Genomes Project pilot data. After filtering out HapMap variants with a MAF of <5% (separately for each population), 99% of HapMap SNPs were found in 1000 Genomes data.
Not all variants cataloged in HapMap are also cataloged in 1000 Genomes. This could affect decisions about which resource to use for SNP queries, rare variant validation, or imputation. Both the HapMap and 1000 Genomes Project databases are useful resources for human genetics, but it is important to understand the assumptions made and filtering strategies employed by these projects.
We recently described Hi-Plex, a highly multiplexed PCR-based target-enrichment system for massively parallel sequencing (MPS), which allows the uniform definition of library size so that subsequent paired-end sequencing can achieve complete overlap of read pairs. Variant calling from Hi-Plex-derived datasets can thus rely on the identification of variants appearing in both reads of read-pairs, permitting stringent filtering of sequencing chemistry-induced errors. These principles underly ROVER software (derived from Read Overlap PCR-MPS variant caller), which we have recently used to report the screening for genetic mutations in the breast cancer predisposition gene PALB2. Here, we describe the algorithms underlying ROVER and its usage.
ROVER enables users to quickly and accurately identify genetic variants from PCR-targeted, overlapping paired-end MPS datasets. The open-source availability of the software and threshold tailorability enables broad access for a range of PCR-MPS users.
ROVER is implemented in Python and runs on all popular POSIX-like operating systems (Linux, OS X). The software accepts a tab-delimited text file listing the coordinates of the target-specific primers used for targeted enrichment based on a specified genome-build. It also accepts aligned sequence files resulting from mapping to the same genome-build. ROVER identifies the amplicon a given read-pair represents and removes the primer sequences by using the mapping co-ordinates and primer co-ordinates. It considers overlapping read-pairs with respect to primer-intervening sequence. Only when a variant is observed in both reads of a read-pair does the signal contribute to a tally of read-pairs containing or not containing the variant. A user-defined threshold informs the minimum number of, and proportion of, read-pairs a variant must be observed in for a ‘call’ to be made. ROVER also reports the depth of coverage across amplicons to facilitate the identification of any regions that may require further screening.
ROVER can facilitate rapid and accurate genetic variant calling for a broad range of PCR-MPS users.
PCR-MPS; Hi-Plex; Targeted sequencing; Massively parallel sequencing; Variant calling; ROVER variant caller