Usually, next generation sequencing (NGS) technology has the property of ultra-high throughput but the read length is remarkably short compared to conventional Sanger sequencing. Paired-end NGS could computationally extend the read length but with a lot of practical inconvenience because of the inherent gaps. Now that Illumina paired-end sequencing has the ability of read both ends from 600 bp or even 800 bp DNA fragments, how to fill in the gaps between paired ends to produce accurate long reads is intriguing but challenging.
We have developed a new technology, referred to as pseudo-Sanger (PS) sequencing. It tries to fill in the gaps between paired ends and could generate near error-free sequences equivalent to the conventional Sanger reads in length but with the high throughput of the Next Generation Sequencing. The major novelty of PS method lies on that the gap filling is based on local assembly of paired-end reads which have overlaps with at either end. Thus, we are able to fill in the gaps in repetitive genomic region correctly. The PS sequencing starts with short reads from NGS platforms, using a series of paired-end libraries of stepwise decreasing insert sizes. A computational method is introduced to transform these special paired-end reads into long and near error-free PS sequences, which correspond in length to those with the largest insert sizes. The PS construction has 3 advantages over untransformed reads: gap filling, error correction and heterozygote tolerance. Among the many applications of the PS construction is de novo genome assembly, which we tested in this study. Assembly of PS reads from a non-isogenic strain of Drosophila melanogaster yields an N50 contig of 190 kb, a 5 fold improvement over the existing de novo assembly methods and a 3 fold advantage over the assembly of long reads from 454 sequencing.
Our method generated near error-free long reads from NGS paired-end sequencing. We demonstrated that de novo assembly could benefit a lot from these Sanger-like reads. Besides, the characteristic of the long reads could be applied to such applications as structural variations detection and metagenomics.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-14-711) contains supplementary material, which is available to authorized users.
Next-generation sequencing; Gap filling; Genome assembly
Next-generation sequencing technologies allow genomes to be sequenced more quickly and less expensively than ever before. However, as sequencing technology has improved, the difficulty of de novo genome assembly has increased, due in large part to the shorter reads generated by the new technologies. The use of mated sequences (referred to as mate-pairs) is a standard means of disambiguating assemblies to obtain a more complete picture of the genome without resorting to manual finishing. Here, we examine the effectiveness of mate-pair information in resolving repeated sequences in the DNA (a paramount issue to overcome). While it has been empirically accepted that mate-pairs improve assemblies, and a variety of assemblers use mate-pairs in the context of repeat resolution, the effectiveness of mate-pairs in this context has not been systematically evaluated in previous literature.
We show that, in high-coverage prokaryotic assemblies, libraries of short mate-pairs (about 4-6 times the read-length) more effectively disambiguate repeat regions than the libraries that are commonly constructed in current genome projects. We also demonstrate that the best assemblies can be obtained by 'tuning' mate-pair libraries to accommodate the specific repeat structure of the genome being assembled - information that can be obtained through an initial assembly using unpaired reads. These results are shown across 360 simulations on 'ideal' prokaryotic data as well as assembly of 8 bacterial genomes using SOAPdenovo. The simulation results provide an upper-bound on the potential value of mate-pairs for resolving repeated sequences in real prokaryotic data sets. The assembly results show that our method of tuning mate-pairs exploits fundamental properties of these genomes, leading to better assemblies even when using an off -the-shelf assembler in the presence of base-call errors.
Our results demonstrate that dramatic improvements in prokaryotic genome assembly quality can be achieved by tuning mate-pair sizes to the actual repeat structure of a genome, suggesting the possible need to change the way sequencing projects are designed. We propose that a two-tiered approach - first generate an assembly of the genome with unpaired reads in order to evaluate the repeat structure of the genome; then generate the mate-pair libraries that provide most information towards the resolution of repeats in the genome being assembled - is not only possible, but likely also more cost-effective as it will significantly reduce downstream manual finishing costs. In future work we intend to address the question of whether this result can be extended to larger eukaryotic genomes, where repeat structure can be quite different.
Motivation: Second-generation sequencing technologies produce high coverage of the genome by short reads at a low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this article, we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms large numbers of paired-end reads into a much smaller number of longer ‘super-reads’. The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced ‘mazurka’).
Results: We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two datasets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads.
Availability: MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year.
Supplementary data are available at Bioinformatics online.
Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on k-mers or read overlaps do not fully utilize information of paired-end and single-end reads when resolving branches. Since they treat all single-end reads with overlapped length larger than a fix threshold equally, they fail to use the more confident long overlapped reads for assembling and mix up with the relative short overlapped reads. Moreover, these approaches have not been special designed for handling tandem repeats (repeats occur adjacently in the genome) and they usually break down the contigs near the tandem repeats. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided de novo assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds using paired-end reads and different read overlap size ranging from Omax to Omin to resolve the gaps and branches. By constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contig by all feasible extensions and determine the correct extension by using look-ahead approach. Many difficult-resolved branches are due to tandem repeats which are close in the genome. PERGA detects such different copies of the repeats to resolve the branches to make the extension much longer and more accurate. We evaluated PERGA on both Illumina real and simulated datasets ranging from small bacterial genomes to large human chromosome, and it constructed longer and more accurate contigs and scaffolds than other state-of-the-art assemblers. PERGA can be freely downloaded at https://github.com/hitbio/PERGA.
The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. Among the existing computational tools for metagenomic analysis, there are similarity-based methods that use reference databases to align reads and composition-based methods that use composition patterns (i.e., frequencies of short words or l-mers) to cluster reads. Similarity-based methods are unable to classify reads from unknown species without close references (which constitute the majority of reads). Since composition patterns are preserved only in significantly large fragments, composition-based tools cannot be used for very short reads, which becomes a significant limitation with the development of NGS. A recently proposed algorithm, AbundanceBin, introduced another method that bins reads based on predicted abundances of the genomes sequenced. However, it does not separate reads from genomes of similar abundance levels.
In this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. It is initially designed for genomes with similar abundance levels and then extended to handle arbitrary abundance ratios. The software can be download for free at
Our tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.
Metagenomics; NGS short reads; Genome separation; Clustering
While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitations prevent the de novo sequencing of eukaryotic genomes with the standard shotgun sequencing protocol. We present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that utilizes high-throughput short-read technologies. We describe a variation on hierarchical sequencing with two crucial differences: (1) we select a clone library from the genome randomly rather than as a tiling path and (2) we sample clones from the genome at high coverage and reads from the clones at low coverage. We assume that 200 bp read lengths with a 1% error rate and inexpensive random fragment cloning on whole mammalian genomes is feasible. Our assembly methodology is based on first ordering the clones and subsequently performing read assembly in three stages: (1) local assemblies of regions significantly smaller than a clone size, (2) clone-sized assemblies of the results of stage 1, and (3) chromosome-sized assemblies. By aggressively localizing the assembly problem during the first stage, our method succeeds in assembling short, unpaired reads sampled from repetitive genomes. We tested our assembler using simulated reads from D. melanogaster and human chromosomes 1, 11, and 21, and produced assemblies with large sets of contiguous sequence and a misassembly rate comparable to other draft assemblies. Tested on D. melanogaster and the entire human genome, our clone-ordering method produces accurate maps, thereby localizing fragment assembly and enabling the parallelization of the subsequent steps of our pipeline. Thus, we have demonstrated that truly inexpensive de novo sequencing of mammalian genomes will soon be possible with high-throughput, short-read technologies using our methodology.
Motivation: The Illumina paired-end sequencing technology can generate reads from both ends of target DNA fragments, which can subsequently be merged to increase the overall read length. There already exist tools for merging these paired-end reads when the target fragments are equally long. However, when fragment lengths vary and, in particular, when either the fragment size is shorter than a single-end read, or longer than twice the size of a single-end read, most state-of-the-art mergers fail to generate reliable results. Therefore, a robust tool is needed to merge paired-end reads that exhibit varying overlap lengths because of varying target fragment lengths.
Results: We present the PEAR software for merging raw Illumina paired-end reads from target fragments of varying length. The program evaluates all possible paired-end read overlaps and does not require the target fragment size as input. It also implements a statistical test for minimizing false-positive results. Tests on simulated and empirical data show that PEAR consistently generates highly accurate merged paired-end reads. A highly optimized implementation allows for merging millions of paired-end reads within a few minutes on a standard desktop computer. On multi-core architectures, the parallel version of PEAR shows linear speedups compared with the sequential version of PEAR.
Availability and implementation: PEAR is implemented in C and uses POSIX threads. It is freely available at http://www.exelixis-lab.org/web/software/pear.
Next Generation Sequencing technologies are able to provide high genome coverages at a relatively low cost. However, due to limited reads' length (from 30 bp up to 200 bp), specific bioinformatics problems have become even more difficult to solve. De novo assembly with short reads, for example, is more complicated at least for two reasons: first, the overall amount of "noisy" data to cope with increased and, second, as the reads' length decreases the number of unsolvable repeats grows. Our work's aim is to go at the root of the problem by providing a pre-processing tool capable to produce (in-silico) longer and highly accurate sequences from a collection of Next Generation Sequencing reads.
In this paper a seed-and-extend local assembler is presented. The kernel algorithm is a loop that, starting from a read used as seed, keeps extending it using heuristics whose main goal is to produce a collection of error-free and longer sequences. In particular, GapFiller carefully detects reliable overlaps and operates clustering similar reads in order to reconstruct the missing part between the two ends of the same insert. Our tool's output has been validated on 24 experiments using both simulated and real paired reads datasets. The output sequences are declared correct when the seed-mate is found. In the experiments performed, GapFiller was able to extend high percentages of the processed seeds and find their mates, with a false positives rate that turned out to be nearly negligible.
GapFiller, starting from a sufficiently high short reads coverage, is able to produce high coverages of accurate longer sequences (from 300 bp up to 3500 bp). The procedure to perform safe extensions, together with the mate-found check, turned out to be a powerful criterion to guarantee contigs' correctness. GapFiller has further potential, as it could be applied in a number of different scenarios, including the post-processing validation of insertions/deletions detection pipelines, pre-processing routines on datasets for de novo assembly pipelines, or in any hierarchical approach designed to assemble, analyse or validate pools of sequences.
de novo assembly; paired reads; (Hamming-aware) hash functions; Next Generation Sequencing data
Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm.
Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.
Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.
Supplementary information: Supplementary data are available at Bioinformatics online.
Microbiome-wide gene expression profiling through high-throughput RNA sequencing (‘metatranscriptomics’) offers a powerful means to functionally interrogate complex microbial communities. Key to successful exploitation of these datasets is the ability to confidently match relatively short sequence reads to known bacterial transcripts. In the absence of reference genomes, such annotation efforts may be enhanced by assembling reads into longer contiguous sequences (‘contigs’), prior to database search strategies. Since reads from homologous transcripts may derive from several species, represented at different abundance levels, it is not clear how well current assembly pipelines perform for metatranscriptomic datasets. Here we evaluate the performance of four currently employed assemblers including de novo transcriptome assemblers - Trinity and Oases; the metagenomic assembler - Metavelvet; and the recently developed metatranscriptomic assembler IDBA-MT.
We evaluated the performance of the assemblers on a previously published dataset of single-end RNA sequence reads derived from the large intestine of an inbred non-obese diabetic mouse model of type 1 diabetes. We found that Trinity performed best as judged by contigs assembled, reads assigned to contigs, and number of reads that could be annotated to a known bacterial transcript. Only 15.5% of RNA sequence reads could be annotated to a known transcript in contrast to 50.3% with Trinity assembly. Paired-end reads generated from the same mouse samples resulted in modest performance gains. A database search estimated that the assemblies are unlikely to erroneously merge multiple unrelated genes sharing a region of similarity (<2% of contigs). A simulated dataset based on ten species confirmed these findings. A more complex simulated dataset based on 72 species found that greater assembly errors were introduced than is expected by sequencing quality. Through the detailed evaluation of assembly performance, the insights provided by this study will help drive the design of future metatranscriptomic analyses.
Assembly of metatranscriptome datasets greatly improved read annotation. Of the four assemblers evaluated, Trinity provided the best performance. For more complex datasets, reads generated from transcripts sharing considerable sequence similarity can be a source of significant assembly error, suggesting a need to collate reads on the basis of common taxonomic origin prior to assembly.
Microbiome; Metatranscriptomics; Sequence assembly; Bioinformatics; RNA sequencing
Motivation: Next-generation sequencing captures sequence differences in reads relative to a reference genome or transcriptome, including splicing events and complex variants involving multiple mismatches and long indels. We present computational methods for fast detection of complex variants and splicing in short reads, based on a successively constrained search process of merging and filtering position lists from a genomic index. Our methods are implemented in GSNAP (Genomic Short-read Nucleotide Alignment Program), which can align both single- and paired-end reads as short as 14 nt and of arbitrarily long length. It can detect short- and long-distance splicing, including interchromosomal splicing, in individual reads, using probabilistic models or a database of known splice sites. Our program also permits SNP-tolerant alignment to a reference space of all possible combinations of major and minor alleles, and can align reads from bisulfite-treated DNA for the study of methylation state.
Results: In comparison testing, GSNAP has speeds comparable to existing programs, especially in reads of ≥70 nt and is fastest in detecting complex variants with four or more mismatches or insertions of 1–9 nt and deletions of 1–30 nt. Although SNP tolerance does not increase alignment yield substantially, it affects alignment results in 7–8% of transcriptional reads, typically by revealing alternate genomic mappings for a read. Simulations of bisulfite-converted DNA show a decrease in identifying genomic positions uniquely in 6% of 36 nt reads and 3% of 70 nt reads.
Availability: Source code in C and utility programs in Perl are freely available for download as part of the GMAP package at http://share.gene.com/gmap.
The assembly of transcriptomes is a difficult challenge due to the complexity of transcriptomes which include multiple isoforms of various transcripts resulting in many highly similar sequences.When high throughput sequence reads are used, the short read lengths often are not long enough to cover an entire exon or exons with retained introns.NextGENe's novel Condensation Tool is used to cluster reads containing similar sequences to statistically correct sequencing errors.The first cycle uses 12mer anchor sequences and 12bp flanking sequences to sort reads into groups and subgroups to generate a consensus.The second cycle and any additional cycles incrementally elongate reads by merging similar contigs, allowing minimum variation.In this way, NextGENe Software's Condensation Tool overcomes many of the challenges involved with the analysis of next generation sequencing data.By clustering similar reads, data is polished, removing low frequency, biased sequencing errors while maintaining true variations.Reads are lengthened to provide a more unique sequence greatly increasing assembly accuracy. Following Condensation, overlapping paired end reads can be linked to generate single reads spanning the entire library, up to 1000bp.
Motivation: New, high-throughput sequencing technologies have made it feasible to cheaply generate vast amounts of sequence information from a genome of interest. The computational reconstruction of the complete sequence of a genome is complicated by specific features of these new sequencing technologies, such as the short length of the sequencing reads and absence of mate-pair information. In this article we propose methods to overcome such limitations by incorporating information from optical restriction maps.
Results: We demonstrate the robustness of our methods to sequencing and assembly errors using extensive experiments on simulated datasets. We then present the results obtained by applying our algorithms to data generated from two bacterial genomes Yersinia aldovae and Yersinia kristensenii. The resulting assemblies contain a single scaffold covering a large fraction of the respective genomes, suggesting that the careful use of optical maps can provide a cost-effective framework for the assembly of genomes.
Availability: The tools described here are available as an open-source package at ftp://ftp.cbcb.umd.edu/pub/software/soma
Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available.
We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated.
QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu.
A wide variety of short-read alignment programmes have been published recently to tackle the problem of mapping millions of short reads to a reference genome, focusing on different aspects of the procedure such as time and memory efficiency, sensitivity, and accuracy. These tools allow for a small number of mismatches in the alignment; however, their ability to allow for gaps varies greatly, with many performing poorly or not allowing them at all. The seed-and-extend strategy is applied in most short-read alignment programmes. After aligning a substring of the reference sequence against the high-quality prefix of a short read--the seed--an important problem is to find the best possible alignment between a substring of the reference sequence succeeding and the remaining suffix of low quality of the read--extend. The fact that the reads are rather short and that the gap occurrence frequency observed in various studies is rather low suggest that aligning (parts of) those reads with a single gap is in fact desirable.
In this article, we present libgapmis, a library for extending pairwise short-read alignments. Apart from the standard CPU version, it includes ultrafast SSE- and GPU-based implementations. libgapmis is based on an algorithm computing a modified version of the traditional dynamic-programming matrix for sequence alignment. Extensive experimental results demonstrate that the functions of the CPU version provided in this library accelerate the computations by a factor of 20 compared to other programmes. The analogous SSE- and GPU-based implementations accelerate the computations by a factor of 6 and 11, respectively, compared to the CPU version. The library also provides the user the flexibility to split the read into fragments, based on the observed gap occurrence frequency and the length of the read, thereby allowing for a variable, but bounded, number of gaps in the alignment.
We present libgapmis, a library for extending pairwise short-read alignments. We show that libgapmis is better-suited and more efficient than existing algorithms for this task. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any short-read alignment pipeline. The open-source code of libgapmis is available at http://www.exelixis-lab.org/gapmis.
The development of next-generation sequencing (NGS) technologies has dramatically increased the throughput, speed, and efficiency of genome sequencing. The short read data generated from NGS platforms, such as SOLiD and Illumina, are quite useful for mapping analysis. However, the SOLiD read data with lengths of <60 bp have been considered to be too short for de novo genome sequencing. Here, to investigate whether de novo sequencing of fungal genomes is possible using only SOLiD short read sequence data, we performed de novo assembly of the Aspergillus oryzae RIB40 genome using only SOLiD read data of 50 bp generated from mate-paired libraries with 2.8- or 1.9-kb insert sizes. The assembled scaffolds showed an N50 value of 1.6 Mb, a 22-fold increase than those obtained using only SOLiD short read in other published reports. In addition, almost 99% of the reference genome was accurately aligned by the assembled scaffold fragments in long lengths. The sequences of secondary metabolite biosynthetic genes and clusters, whose products are of considerable interest in fungal studies due to their potential medicinal, agricultural, and cosmetic properties, were also highly reconstructed in the assembled scaffolds. Based on these findings, we concluded that de novo genome sequencing using only SOLiD short reads is feasible and practical for molecular biological study of fungi. We also investigated the effect of filtering low quality data, library insert size, and k-mer size on the assembly performance, and recommend for the assembly use of mild filtered read data where the N50 was not so degraded and the library has an insert size of ∼2.0 kb, and k-mer size 33.
Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies.
We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50 in Pseudomonas syringae and a unique 147 kbp scaffold of a ferret BAC clone. We also present an efficient algorithm called Rock Band for the resolution of repeats in the case of mixed length assemblies, where different sequencing platforms are combined to obtain a cost-effective assembly.
These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read de novo assembler.
Roche 454 pyrosequencing has become a method of choice for generating transcriptome data from non-model organisms. Once the tens to hundreds of thousands of short (250-450 base) reads have been produced, it is important to correctly assemble these to estimate the sequence of all the transcripts. Most transcriptome assembly projects use only one program for assembling 454 pyrosequencing reads, but there is no evidence that the programs used to date are optimal. We have carried out a systematic comparison of five assemblers (CAP3, MIRA, Newbler, SeqMan and CLC) to establish best practices for transcriptome assemblies, using a new dataset from the parasitic nematode Litomosoides sigmodontis.
Although no single assembler performed best on all our criteria, Newbler 2.5 gave longer contigs, better alignments to some reference sequences, and was fast and easy to use. SeqMan assemblies performed best on the criterion of recapitulating known transcripts, and had more novel sequence than the other assemblers, but generated an excess of small, redundant contigs. The remaining assemblers all performed almost as well, with the exception of Newbler 2.3 (the version currently used by most assembly projects), which generated assemblies that had significantly lower total length. As different assemblers use different underlying algorithms to generate contigs, we also explored merging of assemblies and found that the merged datasets not only aligned better to reference sequences than individual assemblies, but were also more consistent in the number and size of contigs.
Transcriptome assemblies are smaller than genome assemblies and thus should be more computationally tractable, but are often harder because individual contigs can have highly variable read coverage. Comparing single assemblers, Newbler 2.5 performed best on our trial data set, but other assemblers were closely comparable. Combining differently optimal assemblies from different programs however gave a more credible final product, and this strategy is recommended.
NextGen sequencing is a powerful and cost efficient tool for ultra-high-throughput genome and transcriptome analysis. One of the key features of next generation sequencing is de novo whole genome sequencing, but assembly and genome finishing is still a major challenge due to short reads generated by these technologies. The 2kb-5kb mate pair reads combined with Illumina short pair-end reads are used in getting better genomic coverage across the genome. The standard 2kb-5kb Illumina mate-pair library construction protocol does not allow barcoding, and has built-in limitations that prevent getting more than 36bp reads at either end, as increasing read length can lead to elevated error rate. This is due to the fact that the junction reads cannot be identified easily if working with de novo assembly or those reads got discarded, since they would not align to reference sequence. Here, we demonstrate a modified 2kb-5kb mate pair library construction protocol for Illumina technologies that allows long barcoded, mate-paired reads without increasing error rates.
With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/.
The successful mapping of high-throughput sequencing (HTS) reads to reference genomes largely depends on the accuracy of both the sequencing technologies and reference genomes. Current mapping algorithms focus on mapping with mismatches but largely neglect insertions and deletions—regardless of whether they are caused by sequencing errors or genomic variation. Furthermore, trailing contaminations by primers and declining read qualities can be cumbersome for programs that allow a maximum number of mismatches. We have developed and implemented a new approach for short read mapping that, in a first step, computes exact matches of the read and the reference genome. The exact matches are then modified by a limited number of mismatches, insertions and deletions. From the set of exact and inexact matches, we select those with minimum score-based E-values. This gives a set of regions in the reference genome which is aligned to the read using Myers bitvector algorithm . Our method utilizes enhanced suffix arrays  to quickly find the exact and inexact matches. It maps more reads and achieves higher recall rates than previous methods. This consistently holds for reads produced by 454 as well as Illumina sequencing technologies.
RNA-seq has shown huge potential for phylogenomic inferences in non-model organisms. However, error, incompleteness, and redundant assembled transcripts for each gene in de novo assembly of short reads cause noise in analyses and a large amount of missing data in the aligned matrix. To address these problems, we compare de novo assemblies of paired end 90 bp RNA-seq reads using Oases, Trinity, Trans-ABySS and SOAPdenovo-Trans to transcripts from genome annotation of the model plant Ricinus communis. By doing so we evaluate strategies for optimizing total gene coverage and minimizing assembly chimeras and redundancy.
We found that the frequency and structure of chimeras vary dramatically among different software packages. The differences were largely due to the number of trans-self chimeras that contain repeats in the opposite direction. More than half of the total chimeras in Oases and Trinity were trans-self chimeras. Within each package, we found a trade-off between maximizing reference coverage and minimizing redundancy and chimera rate. In order to reduce redundancy, we investigated three methods: 1) using cap3 and CD-HIT-EST to combine highly similar transcripts, 2) only retaining the transcript with the highest read coverage, or removing the transcript with the lowest read coverage for each subcomponent in Trinity, and 3) filtering Oases single k-mer assemblies by number of transcripts per locus and relative transcript length, and then finding the transcript with the highest read coverage. We then utilized results from blastx against model protein sequences to effectively remove trans chimeras. After optimization, seven assembly strategies among all four packages successfully assembled 42.9–47.1% of reference genes to more than 200 bp, with a chimera rate of 0.92–2.21%, and on average 1.8–3.1 transcripts per reference gene assembled.
With rapidly improving sequencing and assembly tools, our study provides a framework to benchmark and optimize performance before choosing tools or parameter combinations for analyzing short-read RNA-seq data. Our study demonstrates that choice of assembly package, k-mer sizes, post-assembly redundancy-reduction and chimera cleanup, and strand-specific RNA-seq library preparation and assembly dramatically improves gene coverage by non-redundant and non-chimeric transcripts that are optimized for downstream phylogenomic analyses.
1KP; Chimera; De novo assembly; Redundancy; RNA-seq; SOAPdenovo-Trans; Trans-ABySS; Transcriptome; Trinity; Oases
Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms.
Results: Standard overlap assembly methods have time complexity O(N2), where N is the sum of the lengths of the reads. We use the Ferragina–Manzini index (FM-index) derived from the Burrows–Wheeler transform to find overlaps of length at least τ among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.
Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6–40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.
The advent of cheap high through-put sequencing methods has facilitated low coverage skims of a large number of organisms. To maximise the utility of the sequences, assembly into contigs and then ordering of those contigs is required. Whilst sequences can be assembled into contigs de novo, using assembled genomes of closely related organisms as a framework can considerably aid the process. However, the preferred search programs and parameters that will optimise the sensitivity and specificity of the alignments between the sequence reads and the framework genome(s) are not necessarily obvious. Here we demonstrate a process that uses paired-end sequence reads to choose an optimal program and alignment parameters.
Unlike two single fragment reads, in paired-end sequence reads, such as BAC-end sequences, the two sequences in the pair have a known positional relationship in the original genome. This provides an additional level of confidence over match scores and e-values in the accuracy of the positional assignment of the reads in the comparative genome. Three commonly used sequence alignment programs: MegaBLAST, Blastz and PatternHunter were used to align a set of ovine BAC-end sequences against the equine genome assembly. A range of different search parameters, with a particular focus on contiguous and discontiguous seeds, were used for each program. The number of reads with a hit and the number of read pairs with hits for the two end sequences in the tail-to-tail paired-end configuration were plotted relative to the theoretical maximum expected curve. Of the programs tested, MegaBLAST with short contiguous seed lengths (word size 8-11) performed best in this particular task. In addition the data also provides estimates of the false positive and false negative rates, which can be used to determine the appropriate values of additional parameters, such as score cut-off, to balance sensitivity and specificity. To determine whether the approach also worked for the alignment of shorter reads, the first 240 bases of each BAC end sequence were also aligned to the equine genome. Again, contiguous MegaBLAST performed the best in optimising the sensitivity and specificity with which sheep BAC end reads map to the equine and bovine genomes.
Paired-end reads, such as BAC-end sequences, provide an efficient mechanism to optimise sequence alignment parameters, for example for comparative genome assemblies, by providing an objective standard to evaluate performance.
Illumina paired-end reads are used to analyse microbial communities by targeting amplicons of the 16S rRNA gene. Publicly available tools are needed to assemble overlapping paired-end reads while correcting mismatches and uncalled bases; many errors could be corrected to obtain higher sequence yields using quality information.
PANDAseq assembles paired-end reads rapidly and with the correction of most errors. Uncertain error corrections come from reads with many low-quality bases identified by upstream processing. Benchmarks were done using real error masks on simulated data, a pure source template, and a pooled template of genomic DNA from known organisms. PANDAseq assembled reads more rapidly and with reduced error incorporation compared to alternative methods.
PANDAseq rapidly assembles sequences and scales to billions of paired-end reads. Assembly of control libraries showed a 4-50% increase in the number of assembled sequences over naïve assembly with negligible loss of "good" sequence.