Next-generation sequencing technologies have increased the amount of biological data generated. Thus, bioinformatics has become
important because new methods and algorithms are necessary to manipulate and process such data. However, certain challenges
have emerged, such as genome assembly using short reads and high-throughput platforms. In this context, several algorithms have
been developed, such as Velvet, Abyss, Euler-SR, Mira, Edna, Maq, SHRiMP, Newbler, ALLPATHS, Bowtie and BWA. However,
most such assemblers do not have a graphical interface, which makes their use difficult for users without computing experience
given the complexity of the assembler syntax. Thus, to make the operation of such assemblers accessible to users without a
computing background, we developed AutoAssemblyD, which is a graphical tool for genome assembly submission and remote
management by multiple assemblers through XML templates.
AssemblyD is freely available at https://sourceforge.net/projects/autoassemblyd. It requires Sun jdk 6 or higher.
Next-generation sequencing; Genome Assembly; Bioinformatics
An important step in ‘metagenomics’ analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines use a single-genome assembler with carefully optimized parameters. A limitation of a single-genome assembler for de novo metagenome assembly is that sequences of highly abundant species are likely misidentified as repeats in a single genome, resulting in a number of small fragmented scaffolds. We extended a single-genome assembler for short reads, known as ‘Velvet’, to metagenome assembly, which we called ‘MetaVelvet’, for mixed short reads of multiple species. Our fundamental concept was to first decompose a de Bruijn graph constructed from mixed short reads into individual sub-graphs, and second, to build scaffolds based on each decomposed de Bruijn sub-graph as an isolate species genome. We made use of two features, the coverage (abundance) difference and graph connectivity, for the decomposition of the de Bruijn graph. For simulated datasets, MetaVelvet succeeded in generating significantly higher N50 scores than any single-genome assemblers. MetaVelvet also reconstructed relatively low-coverage genome sequences as scaffolds. On real datasets of human gut microbial read data, MetaVelvet produced longer scaffolds and increased the number of predicted genes.
Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30–90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in ∼4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for comparative studies to address basic questions of fungal biology.
Fungi have immense impacts on ecosystems and affect many aspects of society. They are used as convenient organisms for fundamental research because their typically haploid genetics enable straightforward phenotyping of mutations and because most fungal cells can differentiate the entire organism. Fungi have compact genomes with few repetitive sequences, and their genomes should be much easier to assemble from short sequence reads than genomes of mammals or higher plants. To test this idea, we used Solexa and 454 sequencing to generate ∼4 Gb of raw sequence data from the filamentous fungus Sordaria macrospora. De novo assembly yielded 5,097 contigs. This assembly was improved by comparison with reference genomes of three closely related Neurospora species, resulting in placement of ∼40 Mb of genome sequence in 152 scaffolds. From comparisons of predicted proteins we conclude that S. macrospora carries a conserved set of genes for signaling and development, which should encourage its further use as a model organism for morphogenesis and meiosis. We demonstrate that de novo assembly of fungal genomes from short reads is cheap and efficient. Species that are not traditionally considered “model organisms” but await genome sequencing for comparative and functional genomics analyses are at last amenable to in-depth genome-wide analyses.
Recent advances in next-generation sequencing technologies have drastically increased throughput and significantly reduced sequencing costs. However, the average read lengths in next-generation sequencing technologies are short as compared with that of traditional Sanger sequencing. The short sequence reads pose great challenges for de novo sequence assembly. As a pilot project for whole genome sequencing of the catfish genome, here we attempt to determine the proper sequence coverage, the proper software for assembly, and various parameters used for the assembly of a BAC physical map contig spanning approximately a million of base pairs.
A combination of low sequence coverage of 454 and Illumina sequencing appeared to provide effective assembly as reflected by a high N50 value. Using 454 sequencing alone, a sequencing depth of 18 X was sufficient to obtain the good quality assembly, whereas a 70 X Illumina appeared to be sufficient for a good quality assembly. Additional sequencing coverage after 18 X of 454 or after 70 X of Illumina sequencing does not provide significant improvement of the assembly. Considering the cost of sequencing, a 2 X 454 sequencing, when coupled to 70 X Illumina sequencing, provided an assembly of reasonably good quality. With several software tested, Newbler with a seed length of 16 and ABySS with a K-value of 60 appear to be appropriate for the assembly of 454 reads alone and Illumina paired-end reads alone, respectively. Using both 454 and Illumina paired-end reads, a hybrid assembly strategy using Newbler for initial 454 sequence assembly, Velvet for initial Illumina sequence assembly, followed by a second step assembly using MIRA provided the best assembly of the physical map contig, resulting in 193 contigs with a N50 value of 13,123 bp.
A hybrid sequencing strategy using low sequencing depth of 454 and high sequencing depth of Illumina provided the good quality assembly with high N50 value and relatively low cost. A combination of Newbler, Velvet, and MIRA can be used to assemble the 454 sequence reads and the Illumina reads effectively. The assembled sequence can serve as a resource for comparative genome analysis. Additional long reads using the third generation sequencing platforms are needed to sequence through repetitive genome regions that should further enhance the sequence assembly.
De novo genome sequencing of previously uncharacterized microorganisms has the potential to open up new frontiers in microbial genomics by providing insight into both functional capabilities and biodiversity. Until recently, Roche 454 pyrosequencing was the NGS method of choice for de novo assembly because it generates hundreds of thousands of long reads (<450 bps), which are presumed to aid in the analysis of uncharacterized genomes. The array of tools for processing NGS data are increasingly free and open source and are often adopted for both their high quality and role in promoting academic freedom.
The error rate of pyrosequencing the Alcanivorax borkumensis genome was such that thousands of insertions and deletions were artificially introduced into the finished genome. Despite a high coverage (~30 fold), it did not allow the reference genome to be fully mapped. Reads from regions with errors had low quality, low coverage, or were missing. The main defect of the reference mapping was the introduction of artificial indels into contigs through lower than 100% consensus and distracting gene calling due to artificial stop codons. No assembler was able to perform de novo assembly comparable to reference mapping. Automated annotation tools performed similarly on reference mapped and de novo draft genomes, and annotated most CDSs in the de novo assembled draft genomes.
Free and open source software (FOSS) tools for assembly and annotation of NGS data are being developed rapidly to provide accurate results with less computational effort. Usability is not high priority and these tools currently do not allow the data to be processed without manual intervention. Despite this, genome assemblers now readily assemble medium short reads into long contigs (>97-98% genome coverage). A notable gap in pyrosequencing technology is the quality of base pair calling and conflicting base pairs between single reads at the same nucleotide position. Regardless, using draft whole genomes that are not finished and remain fragmented into tens of contigs allows one to characterize unknown bacteria with modest effort.
Reference mapping; De novo sequencing; De novo assembly; Automated annotation; Marine bacteria
The main limitations in the analysis of viral metagenomes are perhaps the high genetic variability and the lack of information in extant databases. To address these issues, several bioinformatic tools have been specifically designed or adapted for metagenomics by improving read assembly and creating more sensitive methods for homology detection. This study compares the performance of different available assemblers and taxonomic annotation software using simulated viral-metagenomic data.
We simulated two 454 viral metagenomes using genomes from NCBI's RefSeq database based on the list of actual viruses found in previously published metagenomes. Three different assembly strategies, spanning six assemblers, were tested for performance: overlap-layout-consensus algorithms Newbler, Celera and Minimo; de Bruijn graphs algorithms Velvet and MetaVelvet; and read probabilistic model Genovo. The performance of the assemblies was measured by the length of resulting contigs (using N50), the percentage of reads assembled and the overall accuracy when comparing against corresponding reference genomes. Additionally, the number of chimeras per contig and the lowest common ancestor were estimated in order to assess the effect of assembling on taxonomic and functional annotation. The functional classification of the reads was evaluated by counting the reads that correctly matched the functional data previously reported for the original genomes and calculating the number of over-represented functional categories in chimeric contigs. The sensitivity and specificity of tBLASTx, PhymmBL and the k-mer frequencies were measured by accurate predictions when comparing simulated reads against the NCBI Virus genomes RefSeq database.
Assembling improves functional annotation by increasing accurate assignations and decreasing ambiguous hits between viruses and bacteria. However, the success is limited by the chimeric contigs occurring at all taxonomic levels. The assembler and its parameters should be selected based on the focus of each study. Minimo's non-chimeric contigs and Genovo's long contigs excelled in taxonomy assignation and functional annotation, respectively.
tBLASTx stood out as the best approach for taxonomic annotation for virus identification. PhymmBL proved useful in datasets in which no related sequences are present as it uses genomic features that may help identify distant taxa. The k-frequencies underperformed in all viral datasets.
Viral metagenome; Assembler performance; Taxonomic classification; Chimera identification; Functional annotation
Usually, next generation sequencing (NGS) technology has the property of ultra-high throughput but the read length is remarkably short compared to conventional Sanger sequencing. Paired-end NGS could computationally extend the read length but with a lot of practical inconvenience because of the inherent gaps. Now that Illumina paired-end sequencing has the ability of read both ends from 600 bp or even 800 bp DNA fragments, how to fill in the gaps between paired ends to produce accurate long reads is intriguing but challenging.
We have developed a new technology, referred to as pseudo-Sanger (PS) sequencing. It tries to fill in the gaps between paired ends and could generate near error-free sequences equivalent to the conventional Sanger reads in length but with the high throughput of the Next Generation Sequencing. The major novelty of PS method lies on that the gap filling is based on local assembly of paired-end reads which have overlaps with at either end. Thus, we are able to fill in the gaps in repetitive genomic region correctly. The PS sequencing starts with short reads from NGS platforms, using a series of paired-end libraries of stepwise decreasing insert sizes. A computational method is introduced to transform these special paired-end reads into long and near error-free PS sequences, which correspond in length to those with the largest insert sizes. The PS construction has 3 advantages over untransformed reads: gap filling, error correction and heterozygote tolerance. Among the many applications of the PS construction is de novo genome assembly, which we tested in this study. Assembly of PS reads from a non-isogenic strain of Drosophila melanogaster yields an N50 contig of 190 kb, a 5 fold improvement over the existing de novo assembly methods and a 3 fold advantage over the assembly of long reads from 454 sequencing.
Our method generated near error-free long reads from NGS paired-end sequencing. We demonstrated that de novo assembly could benefit a lot from these Sanger-like reads. Besides, the characteristic of the long reads could be applied to such applications as structural variations detection and metagenomics.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-14-711) contains supplementary material, which is available to authorized users.
Next-generation sequencing; Gap filling; Genome assembly
Next-Generation-Sequencing is advantageous because of its much higher data throughput and much lower cost compared with the traditional Sanger method. However, NGS reads are shorter than Sanger reads, making de novo genome assembly very challenging. Because genome assembly is essential for all downstream biological studies, great efforts have been made to enhance the completeness of genome assembly, which requires the presence of long reads or long distance information. To improve de novo genome assembly, we develop a computational program, ARF-PE, to increase the length of Illumina reads. ARF-PE takes as input Illumina paired-end (PE) reads and recovers the original DNA fragments from which two ends the paired reads are obtained. On the PE data of four bacteria, ARF-PE recovered >87% of the DNA fragments and achieved >98% of perfect DNA fragment recovery. Using Velvet, SOAPdenovo, Newbler, and CABOG, we evaluated the benefits of recovered DNA fragments to genome assembly. For all four bacteria, the recovered DNA fragments increased the assembly contiguity. For example, the N50 lengths of the P. brasiliensis contigs assembled by SOAPdenovo and Newbler increased from 80,524 bp to 166,573 bp and from 80,655 bp to 193,388 bp, respectively. ARF-PE also increased assembly accuracy in many cases. On the PE data of two fungi and a human chromosome, ARF-PE doubled and tripled the N50 length. However, the assembly accuracies dropped, but still remained >91%. In general, ARF-PE can increase both assembly contiguity and accuracy for bacterial genomes. For complex eukaryotic genomes, ARF-PE is promising because it raises assembly contiguity. But future error correction is needed for ARF-PE to also increase the assembly accuracy. ARF-PE is freely available at http://126.96.36.199/~tliu/arf-pe/.
Taxonomic annotation of reads is an important problem in metagenomic analysis. Existing annotation tools, which rely on the approach of aligning each read to the taxonomic structure, are unable to annotate many reads efficiently and accurately as reads (~100 bp) are short and most of them come from unknown genomes. Previous work has suggested assembling the reads to make longer contigs before annotation. More reads/contigs can be annotated as a longer contig (in Kbp) can be aligned to a taxon even if it is from an unknown species as long as it contains a conserved region of that taxon. Unfortunately existing metagenomic assembly tools are not mature enough to produce long enough contigs. Binning tries to group reads/contigs of similar species together. Intuitively, reads in the same group (cluster) should be annotated to the same taxon and these reads altogether should cover a significant portion of the genome alleviating the problem of short contigs if the quality of binning is high. However, no existing work has tried to use binning results to help solve the annotation problem. This work explores this direction.
In this paper, we describe MetaCluster-TA, an assembly-assisted binning-based annotation tool which relies on an innovative idea of annotating binned reads instead of aligning each read or contig to the taxonomic structure separately. We propose the novel concept of the 'virtual contig' (which can be up to 10 Kb in length) to represent a set of reads and then represent each cluster as a set of 'virtual contigs' (which together can be total up to 1 Mb in length) for annotation. MetaCluster-TA can outperform widely-used MEGAN4 and can annotate (1) more reads since the virtual contigs are much longer; (2) more accurately since each cluster of long virtual contigs contains global information of the sampled genome which tends to be more accurate than short reads or assembled contigs which contain only local information of the genome; and (3) more efficiently since there are much fewer long virtual contigs to align than short reads. MetaCluster-TA outperforms MetaCluster 5.0 as a binning tool since binning itself can be more sensitive and precise given long virtual contigs and the binning results can be improved using the reference taxonomic database.
MetaCluster-TA can outperform widely-used MEGAN4 and can annotate more reads with higher accuracy and higher efficiency. It also outperforms MetaCluster 5.0 as a binning tool.
Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6–40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.
New short-read sequencing technologies produce enormous volumes of 25–30 base paired-end reads. The resulting reads have vastly different characteristics than produced by Sanger sequencing, and require different approaches than the previous generation of sequence assemblers. In this paper, we present a short-read de novo assembler particularly targeted at the new ABI SOLiD sequencing technology.
This paper presents what we believe to be the first de novo sequence assembly results on real data from the emerging SOLiD platform, introduced by Applied Biosystems. Our assembler SHORTY augments short-paired reads using a trivially small number (5 – 10) of seeds of length 300 – 500 bp. These seeds enable us to produce significant assemblies using short-read coverage no more than 100×, which can be obtained in a single run of these high-capacity sequencers. SHORTY exploits two ideas which we believe to be of interest to the short-read assembly community: (1) using single seed reads to crystallize assemblies, and (2) estimating intercontig distances accurately from multiple spanning paired-end reads.
We demonstrate effective assemblies (N50 contig sizes ~40 kb) of three different bacterial species using simulated SOLiD data. Sequencing artifacts limit our performance on real data, however our results on this data are substantially better than those achieved by competing assemblers.
Summary: Bacterial genomes are simpler than mammalian ones, and yet assembling the former from the data currently generated by high-throughput short-read sequencing machines still results in hundreds of contigs. To improve assembly quality, recent studies have utilized longer Pacific Biosciences (PacBio) reads or jumping libraries to connect contigs into larger scaffolds or help assemblers resolve ambiguities in repetitive regions of the genome. However, their popularity in contemporary genomic research is still limited by high cost and error rates. In this work, we explore the possibility of improving assemblies by using complete genomes from closely related species/strains. We present Ragout, a genome rearrangement approach, to address this problem. In contrast with most reference-guided algorithms, where only one reference genome is used, Ragout uses multiple references along with the evolutionary relationship among these references in order to determine the correct order of the contigs. Additionally, Ragout uses the assembly graph and multi-scale synteny blocks to reduce assembly gaps caused by small contigs from the input assembly. In simulations as well as real datasets, we believe that for common bacterial species, where many complete genome sequences from related strains have been available, the current high-throughput short-read sequencing paradigm is sufficient to obtain a single high-quality scaffold for each chromosome.
Availability: The Ragout software is freely available at: https://github.com/fenderglass/Ragout.
Genome assembly is typically a two-stage process: contig assembly followed by the use of paired sequencing reads to join contigs into scaffolds. Scaffolds are usually the focus of reported assembly statistics; longer scaffolds greatly facilitate the use of genome sequences in downstream analyses, and it is appealing to present larger numbers as metrics of assembly performance. However, scaffolds are highly prone to errors, especially when generated using short reads, which can directly result in inflated assembly statistics.
Here we provide the first independent evaluation of scaffolding tools for second-generation sequencing data. We find large variations in the quality of results depending on the tool and dataset used. Even extremely simple test cases of perfect input, constructed to elucidate the behaviour of each algorithm, produced some surprising results. We further dissect the performance of the scaffolders using real and simulated sequencing data derived from the genomes of Staphylococcus aureus, Rhodobacter sphaeroides, Plasmodium falciparum and Homo sapiens. The results from simulated data are of high quality, with several of the tools producing perfect output. However, at least 10% of joins remains unidentified when using real data.
The scaffolders vary in their usability, speed and number of correct and missed joins made between contigs. Results from real data highlight opportunities for further improvements of the tools. Overall, SGA, SOPRA and SSPACE generally outperform the other tools on our datasets. However, the quality of the results is highly dependent on the read mapper and genome complexity.
The de novo assembly of transcriptomes from short shotgun sequences
raises challenges due to random and non-random sequencing biases and
inherent transcript complexity. We sought to define a pipeline for de
novo transcriptome assembly to aid researchers working with
emerging model systems where well annotated genome assemblies are not
available as a reference. To detail this experimental and computational
method, we used early embryos of the sea anemone, Nematostella
vectensis, an emerging model system for studies of animal body plan
evolution. We performed RNA-seq on embryos up to 24 h of development
using Illumina HiSeq technology and evaluated independent de novo
assembly methods. The resulting reads were assembled using either the
Trinity assembler on all quality controlled reads or both the Velvet and
Oases assemblers on reads passing a stringent digital normalization filter.
A control set of mRNA standards from the National Institute of Standards and
Technology (NIST) was included in our experimental pipeline to invest our
transcriptome with quantitative information on absolute transcript levels
and to provide additional quality control.
We generated >200 million paired-end reads from directional cDNA libraries
representing well over 20 Gb of sequence. The Trinity assembler pipeline,
including preliminary quality control steps, resulted in more than 86% of
reads aligning with the reference transcriptome thus generated.
Nevertheless, digital normalization combined with assembly by Velvet and
Oases required far less computing power and decreased processing time while
still mapping 82% of reads. We have made the raw sequencing reads and
assembled transcriptome publically available.
Nematostella vectensis was chosen for its strategic position in the
tree of life for studies into the origins of the animal body plan, however,
the challenge of reference-free transcriptome assembly is relevant to all
systems for which well annotated gene models and independently verified
genome assembly may not be available. To navigate this new territory, we
have constructed a pipeline for library preparation and computational
analysis for de novo transcriptome assembly. The gene models
defined by this reference transcriptome define the set of genes transcribed
in early Nematostella development and will provide a valuable
dataset for further gene regulatory network investigations.
Transcriptome; Gene regulatory network; Nematostella embryonic development; Body plan evolution; Next-generation sequencing; Illumina HiSeq; Trinity; Oases; RNA-seq
Brassica rapa (AA) contains very diverse forms which include oleiferous types and many vegetable types. Genome sequence of B. rapa line Chiifu (ssp. pekinensis), a leafy vegetable type, was published in 2011. Using this knowledge, it is important to develop genomic resources for the oleiferous types of B. rapa. This will allow more involved molecular mapping, in-depth study of molecular mechanisms underlying important agronomic traits and introgression of traits from B. rapa to major oilseed crops - B. juncea (AABB) and B. napus (AACC). The study explores the availability of SNPs in RNA-seq generated contigs of three oleiferous lines of B. rapa - Candle (ssp. oleifera, turnip rape), YSPB-24 and Tetra (ssp. trilocularis, Yellow sarson) and their use in genome-wide linkage mapping and specific-region fine mapping using a RIL population between Chiifu and Tetra.
RNA-seq was carried out on the RNA isolated from young inflorescences containing unopened floral buds, floral axis and small leaves, using Illumina paired-end sequencing technology. Sequence assembly was carried out using the Velvet de-novo programme and the assembled contigs were organised against Chiifu gene models, available in the BRAD-CDS database. RNA-seq confirmed the presence of more than 17,000 single-copy gene models described in the BRAD database. The assembled contigs and the BRAD gene models were analyzed for the presence of SSRs and SNPs. While the number of SSRs was limited, more than 0.2 million SNPs were observed between Chiifu and the three oleiferous lines. Assays for SNPs were designed using KASPar technology and tested on a F7-RIL population derived from a Chiifu x Tetra cross. The design of the SNP assays were based on three considerations - the 50 bp flanking region of the SNPs should be strictly similar, the SNP should have a read-depth of ≥7 and no exon/intron junction should be present within the 101 bp target region. Using these criteria, a total of 640 markers (580 for genome-wide mapping and 60 for specific-region mapping) marking as many genes were tested for mapping. Out of 640 markers that were tested, 594 markers could be mapped unambiguously which included 542 markers for genome-wide mapping and 42 markers for fine mapping of the tet-o locus that is involved with the trait tetralocular ovary in the line Tetra.
A large number of SNPs and PSVs are present in the transcriptome of B. rapa lines for genome-wide linkage mapping and specific-region fine mapping. Criteria used for SNP identification delivered markers, more than 93% of which could be successfully mapped to the F7–RIL population of Chiifu x Tetra cross.
Brassica rapa; RNA-seq; Next generation sequencing; Single nucleotide polymorphism (SNP); Paralog specific variation (PSV); Coding DNA Sequences (CDS); KASPar assays
Motivation: DNA sequence reads from Sanger and pyrosequencing platforms differ in cost, accuracy, typical coverage, average read length and the variety of available paired-end protocols. Both read types can complement one another in a ‘hybrid’ approach to whole-genome shotgun sequencing projects, but assembly software must be modified to accommodate their different characteristics. This is true even of pyrosequencing mated and unmated read combinations. Without special modifications, assemblers tuned for homogeneous sequence data may perform poorly on hybrid data.
Results: Celera Assembler was modified for combinations of ABI 3730 and 454 FLX reads. The revised pipeline called CABOG (Celera Assembler with the Best Overlap Graph) is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths. In tests on four genomes, it generated the longest contigs among all assemblers tested. It exploited the mate constraints provided by paired-end reads from either platform to build larger contigs and scaffolds, which were validated by comparison to a finished reference sequence. A low rate of contig mis-assembly was detected in some CABOG assemblies, but this was reduced in the presence of sufficient mate pair data.
Availability: The software is freely available as open-source from http://wgs-assembler.sf.net under the GNU Public License.
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Several new de novo assembly tools have been developed recently to assemble short sequencing reads generated by next-generation sequencing platforms. However, the performance of these tools under various conditions has not been fully investigated, and sufficient information is not currently available for informed decisions to be made regarding the tool that would be most likely to produce the best performance under a specific set of conditions.
Results: We studied and compared the performance of commonly used de novo assembly tools specifically designed for next-generation sequencing data, including SSAKE, VCAKE, Euler-sr, Edena, Velvet, ABySS and SOAPdenovo. Tools were compared using several performance criteria, including N50 length, sequence coverage and assembly accuracy. Various properties of read data, including single-end/paired-end, sequence GC content, depth of coverage and base calling error rates, were investigated for their effects on the performance of different assembly tools. We also compared the computation time and memory usage of these seven tools. Based on the results of our comparison, the relative performance of individual tools are summarized and tentative guidelines for optimal selection of different assembly tools, under different conditions, are provided.
Supplementary information: Supplementary data are available at Bioinformatics online.
The feasibility of short-read sequencing for genomic analysis was demonstrated for Fibroporia radiculosa, a copper-tolerant fungus that causes brown rot decay of wood. The effect of read quality on genomic assembly was assessed by filtering Illumina GAIIx reads from a single run of a paired-end library (75-nucleotide read length and 300-bp fragment size) at three different stringency levels and then assembling each data set with Velvet. A simple approach was devised to determine which filter stringency was “best.” Venn diagrams identified the regions containing reads that were used in an assembly but were of a low-enough quality to be removed by a filter. By plotting base quality histograms of reads in this region, we judged whether a filter was too stringent or not stringent enough. Our best assembly had a genome size of 33.6 Mb, an N50 of 65.8 kb for a k-mer of 51, and a maximum contig length of 347 kb. Using GeneMark, 9,262 genes were predicted. TargetP and SignalP analyses showed that among the 1,213 genes with secreted products, 986 had motifs for signal peptides and 227 had motifs for signal anchors. Blast2GO analysis provided functional annotation for 5,407 genes. We identified 29 genes with putative roles in copper tolerance and 73 genes for lignocellulose degradation. A search for homologs of these 102 genes showed that F. radiculosa exhibited more similarity to Postia placenta than Serpula lacrymans. Notable differences were found, however, and their involvements in copper tolerance and wood decay are discussed.
Recent improvements in technology have made DNA sequencing dramatically faster and more efficient than ever before. The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths. Short-read sequencing has been applied successfully to resequence the human genome and those of other species but not to whole-genome sequencing of novel organisms. Here we describe the sequencing and assembly of a novel clinical isolate of Pseudomonas aeruginosa, strain PAb1, using very short read technology. From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides. Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly. This study demonstrates the feasibility of very short read sequencing for the sequencing of bacterial genomes, particularly those for which a related species has been sequenced previously, and expands the potential application of this new technology to most known prokaryotic species.
In this paper we demonstrate that a bacterial genome, Pseudomonas aeruginosa, can be decoded using very short DNA sequences, namely, those produced by the newest generation of DNA sequencers such as the Solexa sequencer from Illumina. Our method includes a novel algorithm that uses the protein sequences from other species to assist the assembly of the new genome. This algorithm breaks up the genome into gene-sized chunks that can be put back together relatively easily, even from sequence fragments as short as 30 bases of DNA. We also take advantage of the genomes of related species, using them as reference strains to assist the assembly. By combining these and other techniques, we were able to assemble 94% of the 6.7 million bases of P. aeruginosa into just 76 large pieces. The remaining 6% is contained in 436 smaller fragments. We have made all of our software available for free under open-source licenses, and we have deposited the newly assembled genome in the public GenBank database.
Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies.
We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50 in Pseudomonas syringae and a unique 147 kbp scaffold of a ferret BAC clone. We also present an efficient algorithm called Rock Band for the resolution of repeats in the case of mixed length assemblies, where different sequencing platforms are combined to obtain a cost-effective assembly.
These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read de novo assembler.
The recent proliferation of next generation sequencing with short reads has enabled many new experimental opportunities but, at the same time, has raised formidable computational challenges in genome assembly. One of the key advances that has led to an improvement in contig lengths has been mate pairs, which facilitate the assembly of repeating regions. Mate pairs have been algorithmically incorporated into most next generation assemblers as various heuristic post-processing steps to correct the assembly graph or to link contigs into scaffolds. Such methods have allowed the identification of longer contigs than would be possible with single reads; however, they can still fail to resolve complex repeats. Thus, improved methods for incorporating mate pairs will have a strong effect on contig length in the future. Here, we introduce the paired de Bruijn graph, a generalization of the de Bruijn graph that incorporates mate pair information into the graph structure itself instead of analyzing mate pairs at a post-processing step. This graph has the potential to be used in place of the de Bruijn graph in any de Bruijn graph based assembler, maintaining all other assembly steps such as error-correction and repeat resolution. Through assembly results on simulated perfect data, we argue that this can effectively improve the contig sizes in assembly.
de Bruijn graphs; fragment assembly; mate pairs; paired de Bruijn graphs
The development of next-generation sequencing (NGS) technologies has dramatically increased the throughput, speed, and efficiency of genome sequencing. The short read data generated from NGS platforms, such as SOLiD and Illumina, are quite useful for mapping analysis. However, the SOLiD read data with lengths of <60 bp have been considered to be too short for de novo genome sequencing. Here, to investigate whether de novo sequencing of fungal genomes is possible using only SOLiD short read sequence data, we performed de novo assembly of the Aspergillus oryzae RIB40 genome using only SOLiD read data of 50 bp generated from mate-paired libraries with 2.8- or 1.9-kb insert sizes. The assembled scaffolds showed an N50 value of 1.6 Mb, a 22-fold increase than those obtained using only SOLiD short read in other published reports. In addition, almost 99% of the reference genome was accurately aligned by the assembled scaffold fragments in long lengths. The sequences of secondary metabolite biosynthetic genes and clusters, whose products are of considerable interest in fungal studies due to their potential medicinal, agricultural, and cosmetic properties, were also highly reconstructed in the assembled scaffolds. Based on these findings, we concluded that de novo genome sequencing using only SOLiD short reads is feasible and practical for molecular biological study of fungi. We also investigated the effect of filtering low quality data, library insert size, and k-mer size on the assembly performance, and recommend for the assembly use of mild filtered read data where the N50 was not so degraded and the library has an insert size of ∼2.0 kb, and k-mer size 33.
Second generation sequencing has permitted detailed sequence characterisation at the whole genome level of a growing number of non-model organisms, but the data produced have short read-lengths and biased genome coverage leading to fragmented genome assemblies. The PacBio RS long-read sequencing platform offers the promise of increased read length and unbiased genome coverage and thus the potential to produce genome sequence data of a finished quality containing fewer gaps and longer contigs. However, these advantages come at a much greater cost per nucleotide and with a perceived increase in error-rate. In this investigation, we evaluated the performance of the PacBio RS sequencing platform through the sequencing and de novo assembly of the Potentilla micrantha chloroplast genome.
Following error-correction, a total of 28,638 PacBio RS reads were recovered with a mean read length of 1,902 bp totalling 54,492,250 nucleotides and representing an average depth of coverage of 320× the chloroplast genome. The dataset covered the entire 154,959 bp of the chloroplast genome in a single contig (100% coverage) compared to seven contigs (90.59% coverage) recovered from an Illumina data, and revealed no bias in coverage of GC rich regions. Post-assembly the data were largely concordant with the Illumina data generated and allowed 187 ambiguities in the Illumina data to be resolved. The additional read length also permitted small differences in the two inverted repeat regions to be assigned unambiguously.
This is the first report to our knowledge of a chloroplast genome assembled de novo using PacBio sequence data. The PacBio RS data generated here were assembled into a single large contig spanning the P. micrantha chloroplast genome, with a higher degree of accuracy than an Illumina dataset generated at a much greater depth of coverage, due to longer read lengths and lower GC bias in the data. The results we present suggest PacBio data will be of immense utility for the development of genome sequence assemblies containing fewer unresolved gaps and ambiguities and a significantly smaller number of contigs than could be produced using short-read sequence data alone.
Third-generation sequencing; NGen; Genomics; Assembly; Annotation; Oxford nanopore; Pacific BioSciences; Roche 454
Pleurocybellaporrigens is a mushroom-forming fungus, which has been consumed as a traditional food in Japan. In 2004, 55 people were poisoned by eating the mushroom and 17 people among them died of acute encephalopathy. Since then, the Japanese government has been alerting Japanese people to take precautions against eating the P. porrigens mushroom. Unfortunately, despite efforts, the molecular mechanism of the encephalopathy remains elusive. The genome and transcriptome sequence data of P. porrigens and the related species, however, are not stored in the public database. To gain the omics data in P. porrigens, we sequenced genome and transcriptome of its fruiting bodies and mycelia by next generation sequencing.
Short read sequences of genomic DNAs and mRNAs in P. porrigens were generated by Illumina Genome Analyzer. Genome short reads were de novo assembled into scaffolds using Velvet. Comparisons of genome signatures among Agaricales showed that P. porrigens has a unique genome signature. Transcriptome sequences were assembled into contigs (unigenes). Biological functions of unigenes were predicted by Gene Ontology and KEGG pathway analyses. The majority of unigenes would be novel genes without significant counterparts in the public omics databases.
Functional analyses of unigenes present the existence of numerous novel genes in the basidiomycetes division. The results mean that the omics information such as genome, transcriptome and metabolome in basidiomycetes is short in the current databases. The large-scale omics information on P. porrigens, provided from this research, will give a new data resource for gene discovery in basidiomycetes.
Motivation: Next-generation sequencing technologies generate very large numbers of short reads. Even with very deep genome coverage, short read lengths cause problems in de novo assemblies. The use of paired-end libraries with a fragment size shorter than twice the read length provides an opportunity to generate much longer reads by overlapping and merging read pairs before assembling a genome.
Results: We present FLASH, a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short. We tested the correctness of the tool on one million simulated read pairs, and we then applied it as a pre-processor for genome assemblies of Illumina reads from the bacterium Staphylococcus aureus and human chromosome 14. FLASH correctly extended and merged reads >99% of the time on simulated reads with an error rate of <1%. With adequately set parameters, FLASH correctly merged reads over 90% of the time even when the reads contained up to 5% errors. When FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds.
Availability and Implementation: The FLASH system is implemented in C and is freely available as open-source code at http://www.cbcb.umd.edu/software/flash.