Despite a dramatic increase in the number of complete genome sequences available in public databases, the vast majority of the biological diversity in our world remains unexplored. SRS technologies have the potential to significantly accelerate the sequencing of new organisms.
De novo assembly of SRS data, however, will require the development of new software tools that can overcome the technical limitations of these technologies. An overview of genome assembly is provided in
Box 1.
Box 1. De novo genome assembly
De novo genome assembly is often likened to solving a large jigsaw puzzle without knowing the picture we are trying to reconstruct. Repetitive DNA segments correspond to similarly colored pieces in this puzzle (e.g. sky) that further complicate the reconstruction.
Mathematically, the
de novo assembly problem is difficult irrespective of the sequencing technology used, falling in the class of NP-hard problems [
45], computational problems for which no efficient solution is known. Repeats are the primary source of this complexity, specifically repetitive segments longer than the length of a read. An assembler must either ‘guess’ (often incorrectly) the correct genome from among a large number of alternatives (a number that grows exponentially with the number of repeats in the genome) or restrict itself to assembling only the nonrepetitive segments of the genome, thereby producing a fragmented assembly.
The complexity of the assembly problem has partly been overcome in Sanger projects because of the long reads produced by this technology, as well as through the use of mate-pairs (pairs of reads whose approximate distance within the genome is known). Paired reads are particularly useful as they allow the assembler to correctly resolve repeats and to provide an ordering of the contigs along the genome.
Studies by Chaisson
et al. [
18] and Whiteford
et al. [
16] showed a rapid deterioration in assembly quality as the read length decreases. Chaisson
et al. [
18] showed that, for reads of 750 bp (e.g. Sanger sequencing), an assembly of
Neisseria meningitidis resulted in 59 contigs, 48 of which were >1 kbp, whereas at 70 bp, the assembly consisted of >1800 contigs, of which only a sixth were >1 kbp. Even for relatively long reads (200 bp), the resulting assembly was substantially fragmented (296 contigs). Similar results were obtained by Whiteford
et al. [
16], who observed a rapid decrease in contig sizes for reads shorter than ~50 bp.
To overcome some of the challenges posed by repeats, Sundquist
et al. [
19] proposed a hierarchical sequencing strategy called SHRAP (SHort Read Assembly Protocol) wherein a genome is first sheared into a collection of large fragments [e.g. bacterial artificial chromosome (BAC) clones], each of which is sequenced by SRS. The reads are used to infer a tiling of the BAC clones along the genome, and an assembly is constructed by pooling together reads originating from localized regions within the tiling. The individual assemblies are combined based on the BAC tiling. Tests using simulated data show the SHRAP strategy to be effective in assembling large genomes (several human chromosomes and
Drosophila melanogaster); however, read lengths of 200 bp or longer are necessary for good quality assemblies.
In general, assembly tools originally developed for Sanger sequencing data cannot be directly applied to SRS technologies, partly because of specific algorithmic choices that rely on long read lengths and partly because of the specific error characteristics of SRS data (e.g. pyrosequencing technologies are characterized by high error rates in homopolymer regions). Many of these tools would also encounter performance limitations because of the vastly larger number of reads generated by SRS projects; for example, 8 times coverage of a mammalian genome (3 Gbp in length) requires 30 million Sanger reads but 750 million Illumina reads.
Several assembly programs have been developed for
de novo assembly of SRS data. Newbler (roche-applied-science.com) is distributed with 454 Life Sciences instruments and has been successfully used in the assembly of bacteria [
20]. With sufficiently deep coverage, typically 25–30 times, the resulting assemblies are comparable to those obtained through Sanger sequencing [
21]. Note, however, that these results do not account for the additional information provided by mate-pairs – information commonly available in Sanger data but only recently introduced to the 454 technology.
Three recently developed assembly tools tackle the
de novo assembly using very short sequences (30–40 bp). SSAKE [
22], VCAKE [
23] and SHARCGS [
24] all use a similar ‘greedy’ approach to genome assembly. Specifically, reads are chosen to form ‘seeds’ for contig formation. Each seed is extended by identifying reads that overlap it sufficiently (more than a specific length and quality cut-off) in either the 5′ or 3′ direction. The extension process iteratively grows the contig as long as the extensions are unambiguous (i.e. there are no sequence differences between the multiple reads that overlap the end of the growing contig). This procedure avoids misassemblies caused by repeats but produces very small contigs. The assembly of bacterial genomes using Illumina data created contigs that are only a few kilobases in length [
23,
24], in contrast to hundreds of kilobases commonly achieved in Sanger-based assemblies. This fragmentation is caused in part by inherent difficulties in assembling short read data, although future improvements in assembly algorithms should overcome some limitations through more sophisticated algorithms (as was the case when Sanger sequencing was first introduced). These programs have relatively long running times, on the order of 6–10 h for bacterial assemblies [
23,
24]—at least partly because of the large number of reads generated in an SRS project. By contrast, assemblers for Sanger data can assemble bacterial genomes in just a few minutes.
Another strategy for
de novo genome sequencing uses a hybrid of SRS and Sanger sequencing to reduce costs and fill in coverage gaps caused by to cloning biases. Such an approach was followed by Goldberg
et al. [
25], who used Newbler for an initial assembly of data obtained from a 454 sequencer. They broke the Newbler contigs into overlapping Sanger-sized fragments and used Celera Assembler [
26] to combine these fragments with sequence reads obtained from Sanger sequencers. This strategy proved successful in the assembly of several marine bacteria. The addition of 454 data produced better assemblies than those obtained with Sanger data alone, and for two of the genomes, the hybrid assembly enabled the reconstruction of an entire chromosome without gaps.
The assemblers named above follow the standard over-lap-layout-consensus approach to genome assembly, a paradigm that treats each read as a discrete unit during the reconstruction of a genome. An alternative recently proposed by Chaisson and Pevzner [
27] uses a deBruijn graph approach, an extension of the authors’ prior work on assembly of Sanger data. Briefly, a deBruijn graph assembler starts by decomposing the set of reads into a set of shorter DNA segments. A graph is constructed that contains the segments as nodes and in which two segments are connected if they are adjacent in one of the original reads. A correct reconstruction of the genome is represented as a path through this graph that traverses all the edges (an Eulerian path). By fragmenting the original reads into smaller segments, this paradigm is less affected by the short read lengths generated by SRS technologies, and it also provides a simple mechanism for combining reads of varied lengths. Chaisson and Pevzner [
27] showed their assembler (Euler-SR) is able to generate high-quality assemblies of bacterial genomes from 454 reads and of BAC clones from Solexa reads. They also explored the use of a hybrid assembly approach (Sanger + 454) and interestingly showed that only a small percentage of the longer reads provided information not already represented in the short reads, thus suggesting the need for a careful evaluation of the benefits of hybrid sequencing approaches.