Search tips
Search criteria

Results 1-25 (450142)

Clipboard (0)

Related Articles

1.  Improving mapping and SNP-calling performance in multiplexed targeted next-generation sequencing 
BMC Genomics  2012;13:417.
Compared to classical genotyping, targeted next-generation sequencing (tNGS) can be custom-designed to interrogate entire genomic regions of interest, in order to detect novel as well as known variants. To bring down the per-sample cost, one approach is to pool barcoded NGS libraries before sample enrichment. Still, we lack a complete understanding of how this multiplexed tNGS approach and the varying performance of the ever-evolving analytical tools can affect the quality of variant discovery. Therefore, we evaluated the impact of different software tools and analytical approaches on the discovery of single nucleotide polymorphisms (SNPs) in multiplexed tNGS data. To generate our own test model, we combined a sequence capture method with NGS in three experimental stages of increasing complexity (E. coli genes, multiplexed E. coli, and multiplexed HapMap BRCA1/2 regions).
We successfully enriched barcoded NGS libraries instead of genomic DNA, achieving reproducible coverage profiles (Pearson correlation coefficients of up to 0.99) across multiplexed samples, with <10% strand bias. However, the SNP calling quality was substantially affected by the choice of tools and mapping strategy. With the aim of reducing computational requirements, we compared conventional whole-genome mapping and SNP-calling with a new faster approach: target-region mapping with subsequent ‘read-backmapping’ to the whole genome to reduce the false detection rate. Consequently, we developed a combined mapping pipeline, which includes standard tools (BWA, SAMtools, etc.), and tested it on public HiSeq2000 exome data from the 1000 Genomes Project. Our pipeline saved 12 hours of run time per Hiseq2000 exome sample and detected ~5% more SNPs than the conventional whole genome approach. This suggests that more potential novel SNPs may be discovered using both approaches than with just the conventional approach.
We recommend applying our general ‘two-step’ mapping approach for more efficient SNP discovery in tNGS. Our study has also shown the benefit of computing inter-sample SNP-concordances and inspecting read alignments in order to attain more confident results.
PMCID: PMC3563481  PMID: 22913592
Two-stage mapping; Read-backmapping; Software performance; SNP discovery; Multiplexed targeted next-generation sequencing
2.  U87MG Decoded: The Genomic Sequence of a Cytogenetically Aberrant Human Cancer Cell Line 
PLoS Genetics  2010;6(1):e1000832.
U87MG is a commonly studied grade IV glioma cell line that has been analyzed in at least 1,700 publications over four decades. In order to comprehensively characterize the genome of this cell line and to serve as a model of broad cancer genome sequencing, we have generated greater than 30× genomic sequence coverage using a novel 50-base mate paired strategy with a 1.4kb mean insert library. A total of 1,014,984,286 mate-end and 120,691,623 single-end two-base encoded reads were generated from five slides. All data were aligned using a custom designed tool called BFAST, allowing optimal color space read alignment and accurate identification of DNA variants. The aligned sequence reads and mate-pair information identified 35 interchromosomal translocation events, 1,315 structural variations (>100 bp), 191,743 small (<21 bp) insertions and deletions (indels), and 2,384,470 single nucleotide variations (SNVs). Among these observations, the known homozygous mutation in PTEN was robustly identified, and genes involved in cell adhesion were overrepresented in the mutated gene list. Data were compared to 219,187 heterozygous single nucleotide polymorphisms assayed by Illumina 1M Duo genotyping array to assess accuracy: 93.83% of all SNPs were reliably detected at filtering thresholds that yield greater than 99.99% sequence accuracy. Protein coding sequences were disrupted predominantly in this cancer cell line due to small indels, large deletions, and translocations. In total, 512 genes were homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and 35 by interchromosomal translocations to reveal a highly mutated cell line genome. Of the small homozygously mutated variants, 8 SNVs and 99 indels were novel events not present in dbSNP. These data demonstrate that routine generation of broad cancer genome sequence is possible outside of genome centers. The sequence analysis of U87MG provides an unparalleled level of mutational resolution compared to any cell line to date.
Author Summary
Glioblastoma has a particularly dismal prognosis with median survival time of less than fifteen months. Here, we describe the broad genome sequencing of U87MG, a commonly used and thus well-studied glioblastoma cell line. One of the major features of the U87MG genome is the large number of chromosomal abnormalities, which can be typical of cancer cell lines and primary cancers. The systematic, thorough, and accurate mutational analysis of the U87MG genome comprehensively identifies different classes of genetic mutations including single-nucleotide variations (SNVs), insertions/deletions (indels), and translocations. We found 2,384,470 SNVs, 191,743 small indels, and 1,314 large structural variations. Known gene models were used to predict the effect of these mutations on protein-coding sequence. Mutational analysis revealed 512 genes homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and up to 35 by interchromosomal translocations. The major mutational mechanisms in this brain cancer cell line are small indels and large structural variations. The genomic landscape of U87MG is revealed to be much more complex than previously thought based on lower resolution techniques. This mutational analysis serves as a resource for past and future studies on U87MG, informing them with a thorough description of its mutational state.
PMCID: PMC2813426  PMID: 20126413
3.  Enhancing the detection of barcoded reads in high throughput DNA sequencing data by controlling the false discovery rate 
BMC Bioinformatics  2014;15(1):264.
DNA barcodes are short unique sequences used to label DNA or RNA-derived samples in multiplexed deep sequencing experiments. During the demultiplexing step, barcodes must be detected and their position identified. In some cases (e.g., with PacBio SMRT), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives.
For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements.
In our analysis, barcode sequences showed high rates of coincidental similarities with the Mus musculus reference DNA. This problem became more acute when the length of the barcode sequence decreased and the number of barcodes in the set increased. The method presented in this paper controls the tail area-based false discovery rate to distinguish between barcoded and unbarcoded reads. This method helps to establish the highest acceptable minimal distance between reads and barcode sequences. In a proof of concept experiment we correctly detected barcodes in 83% of the reads with a precision of 89%. Sensitivity improved to 99% at 99% precision when the adjacent primer sequence was incorporated in the analysis. The analysis was further improved using a paired end strategy. Following an analysis of the data for sequence variants induced in the Atp1a1 gene of C57BL/6 murine melanocytes by ultraviolet light and conferring resistance to ouabain, we found no evidence of cross-contamination of DNA material between samples.
Our method offers a proper quantitative treatment of the problem of detecting barcoded reads in a noisy sequencing environment. It is based on the false discovery rate statistics that allows a proper trade-off between sensitivity and precision to be chosen.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-264) contains supplementary material, which is available to authorized users.
PMCID: PMC4133078  PMID: 25099007
4.  Detection of Genomic Variation by Selection of a 9 Mb DNA Region and High Throughput Sequencing 
PLoS ONE  2009;4(8):e6659.
Detection of the rare polymorphisms and causative mutations of genetic diseases in a targeted genomic area has become a major goal in order to understand genomic and phenotypic variability. We have interrogated repeat-masked regions of 8.9 Mb on human chromosomes 21 (7.8 Mb) and 7 (1.1 Mb) from an individual from the International HapMap Project (NA12872). We have optimized a method of genomic selection for high throughput sequencing. Microarray-based selection and sequencing resulted in 260-fold enrichment, with 41% of reads mapping to the target region. 83% of SNPs in the targeted region had at least 4-fold sequence coverage and 54% at least 15-fold. When assaying HapMap SNPs in NA12872, our sequence genotypes are 91.3% concordant in regions with coverage≥4-fold, and 97.9% concordant in regions with coverage≥15-fold. About 81% of the SNPs recovered with both thresholds are listed in dbSNP. We observed that regions with low sequence coverage occur in close proximity to low-complexity DNA. Validation experiments using Sanger sequencing were performed for 46 SNPs with 15-20 fold coverage, with a confirmation rate of 96%, suggesting that DNA selection provides an accurate and cost-effective method for identifying rare genomic variants.
PMCID: PMC2722027  PMID: 19684856
5.  A statistical method for the detection of variants from next-generation resequencing of DNA pools 
Bioinformatics  2010;26(12):i318-i324.
Motivation: Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing.
Results: We describe a novel statistical approach, CRISP [Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms (SNPs) from Pooled sequencing] that is able to identify both rare and common variants by using two approaches: (i) comparing the distribution of allele counts across multiple pools using contingency tables and (ii) evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer demonstrates that it can detect 80–85% of SNPs identified using individual sequencing while achieving a low false discovery rate (3–5%). Comparison with previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP.
Availability: Implementation of this method is available at∼vbansal/software/CRISP/
PMCID: PMC2881398  PMID: 20529923
6.  Enhanced Multiplexing Capabilities of Agilent Technologies SureSelect™ Target Enrichment System for Next-Generation Sequencing 
Demand has never been greater for revolutionary technologies that deliver fast, inexpensive and accurate genome information. Massively parallel sequencing technologies have enabled scientists to discover rare mutations, structural variants, and novel transcripts at an unprecedented rate. To meet the demand for fast, inexpensive and accurate genome analysis method, Agilent Technologies has developed the SureSelect platform, an in-solution hybrid selection technology for systematic re-sequencing of user specific genomic regions. With the implementation of this new technology there is a balancing act of cost, quality and quantity and it is easier for scientists to sequence entire genomes from large sample cohorts. The inexpensive production of large volumes of user specific sequence data is SureSelect's primary advantage over conventional methods. To further reduce costs and take advantage of the increasing capacity of next-generation sequencers, such as the HiSeq2000 and the SOLiD4/4hq, we highlight the ability to multiplex DNA samples in a single sequencing lane/slide while maintaining the coverage necessary to confidently make SNP calls. SureSlelect multiplexing kits have an automation-friendly, easy to use protocol where gDNA libraries are uniquely “tagged” and then combined via mass balance on one flow cell lane/slide. We show high performance across both Illumina and SOLiD multiplexing platforms, as measured by capture efficiency, uniformity and reproducibility. The multiplexing capabilities SureSelect make it a cost effective way to study human and mouse exome, or any user defined region of interest. When multiplexing HapMap samples, >98% concordance between SureSelect re-sequencing results and previously determined genotype is observed. Lastly, we introduce the SureSelect XT kit for preparation of samples for multiplex sequencing using the Illumina GAII or HiSeq. The SureSelect Multiplexing kit provides the ability to combine targeted enrichment with multiplexing, thus maximizing the number of samples that can be sequenced at one time, providing optimum time and cost savings without sacrificing performance.
PMCID: PMC3186649
7.  Pre-capture multiplexing improves efficiency and cost-effectiveness of targeted genomic enrichment 
BMC Genomics  2012;13:618.
Targeted genomic enrichment (TGE) is a widely used method for isolating and enriching specific genomic regions prior to massively parallel sequencing. To make effective use of sequencer output, barcoding and sample pooling (multiplexing) after TGE and prior to sequencing (post-capture multiplexing) has become routine. While previous reports have indicated that multiplexing prior to capture (pre-capture multiplexing) is feasible, no thorough examination of the effect of this method has been completed on a large number of samples. Here we compare standard post-capture TGE to two levels of pre-capture multiplexing: 12 or 16 samples per pool. We evaluated these methods using standard TGE metrics and determined the ability to identify several classes of genetic mutations in three sets of 96 samples, including 48 controls. Our overall goal was to maximize cost reduction and minimize experimental time while maintaining a high percentage of reads on target and a high depth of coverage at thresholds required for variant detection.
We adapted the standard post-capture TGE method for pre-capture TGE with several protocol modifications, including redesign of blocking oligonucleotides and optimization of enzymatic and amplification steps. Pre-capture multiplexing reduced costs for TGE by at least 38% and significantly reduced hands-on time during the TGE protocol. We found that pre-capture multiplexing reduced capture efficiency by 23 or 31% for pre-capture pools of 12 and 16, respectively. However efficiency losses at this step can be compensated by reducing the number of simultaneously sequenced samples. Pre-capture multiplexing and post-capture TGE performed similarly with respect to variant detection of positive control mutations. In addition, we detected no instances of sample switching due to aberrant barcode identification.
Pre-capture multiplexing improves efficiency of TGE experiments with respect to hands-on time and reagent use compared to standard post-capture TGE. A decrease in capture efficiency is observed when using pre-capture multiplexing; however, it does not negatively impact variant detection and can be accommodated by the experimental design.
PMCID: PMC3534602  PMID: 23148716
Massively parallel sequencing; Next-generation sequencing; Genomics; Targeted genomic enrichment; Sequence capture; Pre-capture multiplexing; Post-capture multiplexing; Indexing
8.  Accurate, rapid and high-throughput detection of strain-specific polymorphisms in Bacillus anthracis and Yersinia pestis by next-generation sequencing 
In the event of biocrimes or infectious disease outbreaks, high-resolution genetic characterization for identifying the agent and attributing it to a specific source can be crucial for an effective response. Until recently, in-depth genetic characterization required expensive and time-consuming Sanger sequencing of a few strains, followed by genotyping of a small number of marker loci in a panel of isolates at or by gel-based approaches such as pulsed field gel electrophoresis, which by necessity ignores most of the genome. Next-generation, massively parallel sequencing (MPS) technology (specifically the Applied Biosystems sequencing by oligonucleotide ligation and detection (SOLiD™) system) is a powerful investigative tool for rapid, cost-effective and parallel microbial whole-genome characterization.
To demonstrate the utility of MPS for whole-genome typing of monomorphic pathogens, four Bacillus anthracis and four Yersinia pestis strains were sequenced in parallel. Reads were aligned to complete reference genomes, and genomic variations were identified. Resequencing of the B. anthracis Ames ancestor strain detected no false-positive single-nucleotide polymorphisms (SNPs), and mapping of reads to the Sterne strain correctly identified 98% of the 133 SNPs that are not clustered or associated with repeats. Three geographically distinct B. anthracis strains from the A branch lineage were found to have between 352 and 471 SNPs each, relative to the Ames genome, and one strain harbored a genomic amplification. Sequencing of four Y. pestis strains from the Orientalis lineage identified between 20 and 54 SNPs per strain relative to the CO92 genome, with the single Bolivian isolate having approximately twice as many SNPs as the three more closely related North American strains. Coverage plotting also revealed a common deletion in two strains and an amplification in the Bolivian strain that appear to be due to insertion element-mediated recombination events. Most private SNPs (that is, a, variant found in only one strain in this set) selected for validation by Sanger sequencing were confirmed, although rare false-positive SNPs were associated with variable nucleotide tandem repeats.
The high-throughput, multiplexing capability, and accuracy of this system make it suitable for rapid whole-genome typing of microbial pathogens during a forensic or epidemiological investigation. By interrogating nearly every base of the genome, rare polymorphisms can be reliably discovered, thus facilitating high-resolution strain tracking and strengthening forensic attribution.
PMCID: PMC2988479  PMID: 21092340
9.  Efficient yeast ChIP-Seq using multiplex short-read DNA sequencing 
BMC Genomics  2009;10:37.
Short-read high-throughput DNA sequencing technologies provide new tools to answer biological questions. However, high cost and low throughput limit their widespread use, particularly in organisms with smaller genomes such as S. cerevisiae. Although ChIP-Seq in mammalian cell lines is replacing array-based ChIP-chip as the standard for transcription factor binding studies, ChIP-Seq in yeast is still underutilized compared to ChIP-chip. We developed a multiplex barcoding system that allows simultaneous sequencing and analysis of multiple samples using Illumina's platform. We applied this method to analyze the chromosomal distributions of three yeast DNA binding proteins (Ste12, Cse4 and RNA PolII) and a reference sample (input DNA) in a single experiment and demonstrate its utility for rapid and accurate results at reduced costs.
We developed a barcoding ChIP-Seq method for the concurrent analysis of transcription factor binding sites in yeast. Our multiplex strategy generated high quality data that was indistinguishable from data obtained with non-barcoded libraries. None of the barcoded adapters induced differences relative to a non-barcoded adapter when applied to the same DNA sample. We used this method to map the binding sites for Cse4, Ste12 and Pol II throughout the yeast genome and we found 148 binding targets for Cse4, 823 targets for Ste12 and 2508 targets for PolII. Cse4 was strongly bound to all yeast centromeres as expected and the remaining non-centromeric targets correspond to highly expressed genes in rich media. The presence of Cse4 non-centromeric binding sites was not reported previously.
We designed a multiplex short-read DNA sequencing method to perform efficient ChIP-Seq in yeast and other small genome model organisms. This method produces accurate results with higher throughput and reduced cost. Given constant improvements in high-throughput sequencing technologies, increasing multiplexing will be possible to further decrease costs per sample and to accelerate the completion of large consortium projects such as modENCODE.
PMCID: PMC2656530  PMID: 19159457
10.  Reducing amplification artifacts in high multiplex amplicon sequencing by using molecular barcodes 
BMC Genomics  2015;16(1):589.
PCR amplicon sequencing has been widely used as a targeted approach for both DNA and RNA sequence analysis. High multiplex PCR has further enabled the enrichment of hundreds of amplicons in one simple reaction. At the same time, the performance of PCR amplicon sequencing can be negatively affected by issues such as high duplicate reads, polymerase artifacts and PCR amplification bias. Recently researchers have made some good progress in addressing these shortcomings by incorporating molecular barcodes into PCR primer design. So far, most work has been demonstrated using one to a few pairs of primers, which limits the size of the region one can analyze.
We developed a simple protocol, which enables the use of molecular barcodes in high multiplex PCR with hundreds of amplicons. Using this protocol and reference materials, we demonstrated the applications in accurate variant calling at very low fraction over a large region and in targeted RNA quantification. We also evaluated the protocol’s utility in profiling FFPE samples.
We demonstrated the successful implementation of molecular barcodes in high multiplex PCR, with multiplex scale many times higher than earlier work. We showed that the new protocol combines the benefits of both high multiplex PCR and molecular barcodes, i.e. the analysis of a very large region, low DNA input requirement, very good reproducibility and the ability to detect as low as 1 % mutations with minimal false positives (FP).
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-1806-8) contains supplementary material, which is available to authorized users.
PMCID: PMC4528782  PMID: 26248467
11.  Illumina sequencing of 15 deafness genes using fragmented amplicons 
BMC Research Notes  2014;7:509.
Resequencing of deafness related genes using GS FLX massive parallel sequencing of PCR amplicons spanning selected genes has previously been reported as a successful strategy to discover causal variants. The amplicon lengths were designed to be smaller than the sequencing read length of GS FLX technology, but are longer than Illumina sequencing technology read lengths. Fragmentation is thus required to sequence these amplicons using high throughput Illumina technology.
We performed Illumina sequencing in 4 patients on 563 multiplexed amplicons covering the exons of 15 genes involved in the hearing process. After exploring several fragmentation strategies, the amplicons were fragmented using Covaris sonication prior to library preparation. CLC genomic workbench was used to analyze the data.
We achieve an excellent coverage with more than 99% of the amplicons bases covered. All variants that were previously validated using Sanger sequencing, were also called in this study. Variant calling revealed less false positive and false negative results compared to the previous study. For each patient, several variants were found that are reported by ClinVar as possible hearing loss variants.
Migration from GS FLX amplicon sequencing to Illumina amplicon sequencing is straightforward and leads to more accurate results.
Electronic supplementary material
The online version of this article (doi:10.1186/1756-0500-7-509) contains supplementary material, which is available to authorized users.
PMCID: PMC4266979  PMID: 25106482
12.  Next generation sequence analysis for mitochondrial disorders 
Genome Medicine  2009;1(10):100.
Mitochondrial disorders can originate from mutations in one of many nuclear genes controlling the organelle function or in the mitochondrial genome (mitochondrial DNA (mtDNA)). The large numbers of potential culprit genes, together with the little guidance offered by most clinical phenotypes as to which gene may be causative, are a great challenge for the molecular diagnosis of these disorders.
We developed a novel targeted resequencing assay for mitochondrial disorders relying on microarray-based hybrid capture coupled to next-generation sequencing. Specifically, we subjected the entire mtDNA genome and the exons and intron-exon boundary regions of 362 known or candidate causative nuclear genes to targeted capture and resequencing. We here provide proof-of-concept data by testing one HapMap DNA sample and two positive control samples.
Over 94% of the targeted regions were captured and sequenced with appropriate coverage and quality, allowing reliable variant calling. Pathogenic mutations blindly tested in patients' samples were 100% concordant with previous Sanger sequencing results: a known mutation in Pyruvate dehydrogenase alpha 1 subunit (PDHA1), a novel splicing and a known coding mutation in Hydroxyacyl-CoA dehydrogenase alpha subunit (HADHA) were correctly identified. Of the additional variants recognized, 90 to 94% were present in dbSNP while 6 to 10% represented new alterations. The novel nonsynonymous variants were all in heterozygote state and mostly predicted to be benign. The depth of sequencing coverage of mtDNA was extremely high, suggesting that it may be feasible to detect pathogenic mtDNA mutations confounded by low level heteroplasmy. Only one sequencing lane of an eight lane flow cell was utilized for each sample, indicating that a cost-effective clinical test can be achieved.
Our study indicates that the use of next generation sequencing technology holds great promise as a tool for screening mitochondrial disorders. The availability of a comprehensive molecular diagnostic tool will increase the capacity for early and rapid identification of mitochondrial disorders. In addition, the proposed approach has the potential to identify new mutations in candidate genes, expanding and redefining the spectrum of causative genes responsible for mitochondrial disorders.
PMCID: PMC2784303  PMID: 19852779
13.  How well do HapMap SNPs capture the untyped SNPs? 
BMC Genomics  2006;7:238.
The recent advancement in human genome sequencing and genotyping has revealed millions of single nucleotide polymorphisms (SNP) which determine the variation among human beings. One of the particular important projects is The International HapMap Project which provides the catalogue of human genetic variation for disease association studies. In this paper, we analyzed the genotype data in HapMap project by using National Institute of Environmental Health Sciences Environmental Genome Project (NIEHS EGP) SNPs. We first determine whether the HapMap data are transferable to the NIEHS data. Then, we study how well the HapMap SNPs capture the untyped SNPs in the region. Finally, we provide general guidelines for determining whether the SNPs chosen from HapMap may be able to capture most of the untyped SNPs.
Our analysis shows that HapMap data are not robust enough to capture the untyped variants for most of the human genes. The performance of SNPs for European and Asian samples are marginal in capturing the untyped variants, i.e. approximately 55%. Expectedly, the SNPs from HapMap YRI panel can only capture approximately 30% of the variants. Although the overall performance is low, however, the SNPs for some genes perform very well and are able to capture most of the variants along the gene. This is observed in the European and Asian panel, but not in African panel. Through observation, we concluded that in order to have a well covered SNPs reference panel, the SNPs density and the association among reference SNPs are important to estimate the robustness of the chosen SNPs.
We have analyzed the coverage of HapMap SNPs using NIEHS EGP data. The results show that HapMap SNPs are transferable to the NIEHS SNPs. However, HapMap SNPs cannot capture some of the untyped SNPs and therefore resequencing may be needed to uncover more SNPs in the missing region.
PMCID: PMC1586200  PMID: 16982009
14.  Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing 
Human Molecular Genetics  2009;19(1):122-134.
Many disease-associated variants identified by genome-wide association (GWA) studies are expected to regulate gene expression. Allele-specific expression (ASE) quantifies transcription from both haplotypes using individuals heterozygous at tested SNPs. We performed deep human transcriptome-wide resequencing (RNA-seq) for ASE analysis and expression quantitative trait locus discovery. We resequenced double poly(A)-selected RNA from primary CD4+ T cells (n = 4 individuals, both activated and untreated conditions) and developed tools for paired-end RNA-seq alignment and ASE analysis. We generated an average of 20 million uniquely mapping 45 base reads per sample. We obtained sufficient read depth to test 1371 unique transcripts for ASE. Multiple biases inflate the false discovery rate which we estimate to be ∼50% for random SNPs. However, after controlling for these biases and considering the subset of SNPs that pass HapMap QC, 4.6% of heterozygous SNP-sample pairs show evidence of imbalance (P < 0.001). We validated four findings by both bacterial cloning and Sanger sequencing assays. We also found convincing evidence for allelic imbalance at multiple reporter exonic SNPs in CD6 for two samples heterozygous at the multiple sclerosis-associated variant rs17824933, linking GWA findings with variation in gene expression. Finally, we show in CD4+ T cells from a further individual that high-throughput sequencing of genomic DNA and RNA-seq following enrichment for targeted gene sequences by sequence capture methods offers an unbiased means to increase the read depth for transcripts of interest, and therefore a method to investigate the regulatory role of many disease-associated genetic variants.
PMCID: PMC2792152  PMID: 19825846
15.  Identification of genetic risk variants for deep vein thrombosis by multiplexed next-generation sequencing of 186 hemostatic/pro-inflammatory genes 
Next-generation DNA sequencing is opening new avenues for genetic association studies in common diseases that, like deep vein thrombosis (DVT), have a strong genetic predisposition still largely unexplained by currently identified risk variants. In order to develop sequencing and analytical pipelines for the application of next-generation sequencing to complex diseases, we conducted a pilot study sequencing the coding area of 186 hemostatic/proinflammatory genes in 10 Italian cases of idiopathic DVT and 12 healthy controls.
A molecular-barcoding strategy was used to multiplex DNA target capture and sequencing, while retaining individual sequence information. Genomic libraries with barcode sequence-tags were pooled (in pools of 8 or 16 samples) and enriched for target DNA sequences. Sequencing was performed on ABI SOLiD-4 platforms. We produced > 12 gigabases of raw sequence data to sequence at high coverage (average: 42X) the 700-kilobase target area in 22 individuals. A total of 1876 high-quality genetic variants were identified (1778 single nucleotide substitutions and 98 insertions/deletions). Annotation on databases of genetic variation and human disease mutations revealed several novel, potentially deleterious mutations. We tested 576 common variants in a case-control association analysis, carrying the top-5 associations over to replication in up to 719 DVT cases and 719 controls. We also conducted an analysis of the burden of nonsynonymous variants in coagulation factor and anticoagulant genes. We found an excess of rare missense mutations in anticoagulant genes in DVT cases compared to controls and an association for a missense polymorphism of FGA (rs6050; p = 1.9 × 10-5, OR 1.45; 95% CI, 1.22-1.72; after replication in > 1400 individuals).
We implemented a barcode-based strategy to efficiently multiplex sequencing of hundreds of candidate genes in several individuals. In the relatively small dataset of our pilot study we were able to identify bona fide associations with DVT. Our study illustrates the potential of next-generation sequencing for the discovery of genetic variation predisposing to complex diseases.
PMCID: PMC3305575  PMID: 22353194
Deep vein thrombosis; venous thromboembolism; next-generation sequencing; target capture; multiplexing; FGA; rs6025; heamostateome; DVT; VTE
16.  Semiconductor Sequencing of Human Exomes on the Ion Proton System 
Rapid, accurate, and inexpensive sequencing of exomes is critical to understand DNA variation in human disease. Ion Torrent has developed a benchtop research semiconductor sequencer, the Ion Proton™, that uses a novel CMOS chip with 165 million 1.3mm-diameter microwells, automatically templated sub-micron particles, and integrated hardware and software that enables acquisition of ∼5 billion data points per second over a 2-4 hour runtime with on-instrument signal processing.
To illustrate the speed, accuracy, and ease-of-use of the Proton system, analysis of a HapMap familial trio of exomes will be presented. Exome libraries are obtained with high-specificity hybridization probes targeting ∼50 Mb of human exons that span 21,700 annotated protein-coding genes, microRNA, key non-coding RNA genes, and 44,000 predicted microRNA binding sites. Exome reads map on-target 75-83% between runs and 10.6 Gb of aligned data, obtained from a single P1 chip, yielded 141X average depth with 30X coverage of 90% of targeted bases. Read mapping, coverage analysis, variant calling and annotation are done with Torrent Suite and Ion Reporter™ software. Each trio dataset yielded ∼30,000 SNP calls from single runs that exceeded 9 Gb of aligned data. The observed Het:Hom ratio of 1.4-1.5 matches the published range of 1.25-1.7 for European ethnicity and the observed Ts:Tv ratio of 2.9 agrees well with the published range of 2.8-3.1 for human exomes. The SNP concordance with dbSNP137 is greater than 98% and Het and Hom concordances with Complete Genomics data are 98% and 96%, respectively. Mendelian inheritance analysis indicates that error for Hets is 0.6% with no errors for homozygotic SNPs. The Proton system delivers high-quality individual exome datasets rapidly and can be used for trio analysis to detect shared germline SNPs with high confidence.
The Ion Proton™ System is for research use only and not for use in diagnostic procedures.
PMCID: PMC3635417
17.  A comprehensive resequence analysis of the KLK15–KLK3–KLK2 locus on chromosome 19q13.33 
Human Genetics  2009;127(1):91-99.
Single nucleotide polymorphisms (SNPs) in the KLK3 gene on chromosome 19q13.33 are associated with serum prostate-specific antigen (PSA) levels. Recent genome wide association studies of prostate cancer have yielded conflicting results for association of the same SNPs with prostate cancer risk. Since the KLK3 gene encodes the PSA protein that forms the basis for a widely used screening test for prostate cancer, it is critical to fully characterize genetic variation in this region and assess its relationship with the risk of prostate cancer. We have conducted a next-generation sequence analysis in 78 individuals of European ancestry to characterize common (minor allele frequency, MAF >1%) genetic variation in a 56 kb region on chromosome 19q13.33 centered on the KLK3 gene (chr19:56,019,829–56,076,043 bps). We identified 555 polymorphic loci in the process including 116 novel SNPs and 182 novel insertion/deletion polymorphisms (indels). Based on tagging analysis, 144 loci are necessary to tag the region at an r2 threshold of 0.8 and MAF of 1% or higher, while 86 loci are required to tag the region at an r2 threshold of 0.8 and MAF >5%. Our sequence data augments coverage by 35 and 78% as compared to variants in dbSNP and HapMap, respectively. We observed six non-synonymous amino acid or frame shift changes in the KLK3 gene and three changes in each of the neighboring genes, KLK15 and KLK2. Our study has generated a detailed map of common genetic variation in the genomic region surrounding the KLK3 gene, which should be useful for fine-mapping the association signal as well as determining the contribution of this locus to prostate cancer risk and/or regulation of PSA expression.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-009-0751-5) contains supplementary material, which is available to authorized users.
PMCID: PMC2793378  PMID: 19823874
18.  Accurate and exact CNV identification from targeted high-throughput sequence data 
BMC Genomics  2011;12:184.
Massively parallel sequencing of barcoded DNA samples significantly increases screening efficiency for clinically important genes. Short read aligners are well suited to single nucleotide and indel detection. However, methods for CNV detection from targeted enrichment are lacking. We present a method combining coverage with map information for the identification of deletions and duplications in targeted sequence data.
Sequencing data is first scanned for gains and losses using a comparison of normalized coverage data between samples. CNV calls are confirmed by testing for a signature of sequences that span the CNV breakpoint. With our method, CNVs can be identified regardless of whether breakpoints are within regions targeted for sequencing. For CNVs where at least one breakpoint is within targeted sequence, exact CNV breakpoints can be identified. In a test data set of 96 subjects sequenced across ~1 Mb genomic sequence using multiplexing technology, our method detected mutations as small as 31 bp, predicted quantitative copy count, and had a low false-positive rate.
Application of this method allows for identification of gains and losses in targeted sequence data, providing comprehensive mutation screening when combined with a short read aligner.
PMCID: PMC3088570  PMID: 21486468
19.  Population-based rare variant detection via pooled exome or custom hybridization capture with or without individual indexing 
BMC Genomics  2012;13:683.
Rare genetic variation in the human population is a major source of pathophysiological variability and has been implicated in a host of complex phenotypes and diseases. Finding disease-related genes harboring disparate functional rare variants requires sequencing of many individuals across many genomic regions and comparing against unaffected cohorts. However, despite persistent declines in sequencing costs, population-based rare variant detection across large genomic target regions remains cost prohibitive for most investigators. In addition, DNA samples are often precious and hybridization methods typically require large amounts of input DNA. Pooled sample DNA sequencing is a cost and time-efficient strategy for surveying populations of individuals for rare variants. We set out to 1) create a scalable, multiplexing method for custom capture with or without individual DNA indexing that was amenable to low amounts of input DNA and 2) expand the functionality of the SPLINTER algorithm for calling substitutions, insertions and deletions across either candidate genes or the entire exome by integrating the variant calling algorithm with the dynamic programming aligner, Novoalign.
We report methodology for pooled hybridization capture with pre-enrichment, indexed multiplexing of up to 48 individuals or non-indexed pooled sequencing of up to 92 individuals with as little as 70 ng of DNA per person. Modified solid phase reversible immobilization bead purification strategies enable no sample transfers from sonication in 96-well plates through adapter ligation, resulting in 50% less library preparation reagent consumption. Custom Y-shaped adapters containing novel 7 base pair index sequences with a Hamming distance of ≥2 were directly ligated onto fragmented source DNA eliminating the need for PCR to incorporate indexes, and was followed by a custom blocking strategy using a single oligonucleotide regardless of index sequence. These results were obtained aligning raw reads against the entire genome using Novoalign followed by variant calling of non-indexed pools using SPLINTER or SAMtools for indexed samples. With these pipelines, we find sensitivity and specificity of 99.4% and 99.7% for pooled exome sequencing. Sensitivity, and to a lesser degree specificity, proved to be a function of coverage. For rare variants (≤2% minor allele frequency), we achieved sensitivity and specificity of ≥94.9% and ≥99.99% for custom capture of 2.5 Mb in multiplexed libraries of 22–48 individuals with only ≥5-fold coverage/chromosome, but these parameters improved to ≥98.7 and 100% with 20-fold coverage/chromosome.
This highly scalable methodology enables accurate rare variant detection, with or without individual DNA sample indexing, while reducing the amount of required source DNA and total costs through less hybridization reagent consumption, multi-sample sonication in a standard PCR plate, multiplexed pre-enrichment pooling with a single hybridization and lesser sequencing coverage required to obtain high sensitivity.
PMCID: PMC3534616  PMID: 23216810
Rare variants; Genomics; Exome; Hybridization capture; Multiplexed capture; Indexed capture; SPLINTER
20.  A pyrosequencing-tailored nucleotide barcode design unveils opportunities for large-scale sample multiplexing 
Nucleic Acids Research  2007;35(19):e130.
Multiplexed high-throughput pyrosequencing is currently limited in complexity (number of samples sequenced in parallel), and in capacity (number of sequences obtained per sample). Physical-space segregation of the sequencing platform into a fixed number of channels allows limited multiplexing, but obscures available sequencing space. To overcome these limitations, we have devised a novel barcoding approach to allow for pooling and sequencing of DNA from independent samples, and to facilitate subsequent segregation of sequencing capacity. Forty-eight forward–reverse barcode pairs are described: each forward and each reverse barcode unique with respect to at least 4 nt positions. With improved read lengths of pyrosequencers, combinations of forward and reverse barcodes may be used to sequence from as many as n2 independent libraries for each set of ‘n’ forward and ‘n’ reverse barcodes, for each defined set of cloning-linkers. In two pilot series of barcoded sequencing using the GS20 Sequencer (454/Roche), we found that over 99.8% of obtained sequences could be assigned to 25 independent, uniquely barcoded libraries based on the presence of either a perfect forward or a perfect reverse barcode. The false-discovery rate, as measured by the percentage of sequences with unexpected perfect pairings of unmatched forward and reverse barcodes, was estimated to be <0.005%.
PMCID: PMC2095802  PMID: 17932070
21.  A Balanced Barcoding System for Multiplexed DNA library and SOLiD SAGE “Sequencing” 
A set of 96 molecular barcode adaptors specifically designed for the SOLiD™ platform have been validated for use with DNA fragment and paired end libraries. Moreover, the barcode system is adapted for multiplexed Serial Analysis of Gene Expression (SAGE). DNA libraries are constructed with a multiplex adaptor which consists of three segments: (1) an internal sequencing primer binding site, (2) a barcode decamer sequence and (3) a P2 PCR priming site. The barcode and target DNA are then sequenced as two separate reads from the same strand allowing for the libraries to be pooled in a multiplexed emulsion PCR and deposited into a single spot on a SOLiD™ slide. Similarly, SAGE libraries are constructed with a modified adaptor allowing for the addition of unique barcode primers with a short cycle amplification consistent with the SOLiD™ barcoding system. The modular barcoding design requires only 5bp of sequencing to distinguish 16-plex samples and 10bp of sequencing to distinguish 96-plex samples. The barcodes are optimized in sets of four wherein each set is color balanced at every position. Importantly, clear discrimination between barcode samples is achieved by maintaining a minimum Hamming distance of 3 colorspace calls for optimal data integrity. The DNA barcode system was validated by sequencing of E. coli fragment libraries. Error rates and quality value (QV) scores for the barcode reads were found to be consistent across the final set. Importantly, QV scores were also consistent for the reads, indicating minimal effects of the barcode decamers on bead templating and ligation sequencing efficiency. Furthermore, the set of 16 SAGE barcoded samples yielded Pearson correlations above 0.98. Ongoing development studies include integration with methods of target enrichment that will further enable high levels of DNA and RNA expression library multiplexing afforded by the increasing throughput of the SOLiD™ system.
PMCID: PMC2918128
22.  Management of High-Throughput DNA Sequencing Projects: Alpheus 
High-throughput DNA sequencing has enabled systems biology to begin to address areas in health, agricultural and basic biological research. Concomitant with the opportunities is an absolute necessity to manage significant volumes of high-dimensional and inter-related data and analysis. Alpheus is an analysis pipeline, database and visualization software for use with massively parallel DNA sequencing technologies that feature multi-gigabase throughput characterized by relatively short reads, such as Illumina-Solexa (sequencing-by-synthesis), Roche-454 (pyrosequencing) and Applied Biosystem’s SOLiD (sequencing-by-ligation). Alpheus enables alignment to reference sequence(s), detection of variants and enumeration of sequence abundance, including expression levels in transcriptome sequence. Alpheus is able to detect several types of variants, including non-synonymous and synonymous single nucleotide polymorphisms (SNPs), insertions/deletions (indels), premature stop codons, and splice isoforms. Variant detection is aided by the ability to filter variant calls based on consistency, expected allele frequency, sequence quality, coverage, and variant type in order to minimize false positives while maximizing the identification of true positives. Alpheus also enables comparisons of genes with variants between cases and controls or bulk segregant pools. Sequence-based differential expression comparisons can be developed, with data export to SAS JMP Genomics for statistical analysis.
PMCID: PMC2819532  PMID: 20151039
Alpheus; sequencing-by-synthesis; pyrosequencing; GMAP; GSNAP; resequencing; transcriptome sequencing
23.  Purifying Selection in Deeply Conserved Human Enhancers Is More Consistent than in Coding Sequences 
PLoS ONE  2014;9(7):e103357.
Comparison of polymorphism at synonymous and non-synonymous sites in protein-coding DNA can provide evidence for selective constraint. Non-coding DNA that forms part of the regulatory landscape presents more of a challenge since there is not such a clear-cut distinction between sites under stronger and weaker selective constraint. Here, we consider putative regulatory elements termed Conserved Non-coding Elements (CNEs) defined by their high level of sequence identity across all vertebrates. Some mutations in these regions have been implicated in developmental disorders; we analyse CNE polymorphism data to investigate whether such deleterious effects are widespread in humans. Single nucleotide variants from the HapMap and 1000 Genomes Projects were mapped across nearly 2000 CNEs. In the 1000 Genomes data we find a significant excess of rare derived alleles in CNEs relative to coding sequences; this pattern is absent in HapMap data, apparently obscured by ascertainment bias. The distribution of polymorphism within CNEs is not uniform; we could identify two categories of sites by exploiting deep vertebrate alignments: stretches that are non-variant, and those that have at least one substitution. The conserved category has fewer polymorphic sites and a greater excess of rare derived alleles, which can be explained by a large proportion of sites under strong purifying selection within humans – higher than that for non-synonymous sites in most protein coding regions, and comparable to that at the strongly conserved trans-dev genes. Conversely, the more evolutionarily labile CNE sites have an allele frequency distribution not significantly different from non-synonymous sites. Future studies should exploit genome-wide re-sequencing to obtain better coverage in selected non-coding regions, given the likelihood that mutations in evolutionarily conserved enhancer sequences are deleterious. Discovery pipelines should validate non-coding variants to aid in identifying causal and risk-enhancing variants in complex disorders, in contrast to the current focus on exome sequencing.
PMCID: PMC4111549  PMID: 25062004
24.  The diploid genome sequence of an Asian individual 
Nature  2008;456(7218):60-65.
Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics.
PMCID: PMC2716080  PMID: 18987735
25.  Efficient and Cost Effective Population Resequencing by Pooling and In-Solution Hybridization 
PLoS ONE  2011;6(3):e18353.
High-throughput sequencing of targeted genomic loci in large populations is an effective approach for evaluating the contribution of rare variants to disease risk. We evaluated the feasibility of using in-solution hybridization-based target capture on pooled DNA samples to enable cost-efficient population sequencing studies. For this, we performed pooled sequencing of 100 HapMap samples across ∼600 kb of DNA sequence using the Illumina GAIIx. Using our accurate variant calling method for pooled sequence data, we were able to not only identify single nucleotide variants with a low false discovery rate (<1%) but also accurately detect short insertion/deletion variants. In addition, with sufficient coverage per individual in each pool (30-fold) we detected 97.2% of the total variants and 93.6% of variants below 5% in frequency. Finally, allele frequencies for single nucleotide variants (SNVs) estimated from the pooled data and the HapMap genotype data were tightly correlated (correlation coefficient > =  0.995).
PMCID: PMC3068187  PMID: 21479135

Results 1-25 (450142)