|Home | About | Journals | Submit | Contact Us | Français|
The emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly.
The advent of short-read sequencing machines gave rise to a new generation of assembly algorithms and software. This survey reviews algorithms for de novo whole-genome shotgun assembly from next-generation sequencing data. It describes and compares algorithms that have been presented in the scientific literature and implemented in software. We use a narrow definition of de novo whole-genome shotgun assembly. The shotgun process takes reads from random positions along a target molecule . Whole-genome shotgun (WGS) sequencing samples the chromosomes that make up one genome. WGS assembly is the reconstruction of sequence up to chromosome length. The assembly task is relegated to computer software . Assembly is possible when the target is over-sampled by the shotgun reads, such that reads overlap. De novo WGS assembly refers to reconstruction in its pure form, without consultation to previously resolved sequence including from genomes, transcripts, and proteins. De novo WGS assembly of next-generation sequencing (NGS) data is a specialized problem due to the short read lengths and large volumes of NGS data. Benchmarking the implementations is beyond the scope of this review. Broader introductions can be found elsewhere, e.g. .
Today’s commercial DNA sequencing platforms include the Genome Sequencer from Roche 454 Life Sciences (www.454.com), the Solexa Genome Analyzer from Illumina (www.illumina.com), the SOLiD System from Applied Biosystems (www.appliedbiosystems.com), the Heliscope from Helicos (www.helicos.com), and the commercialized Polonator (www.polonator.org). These platforms have been well reviewed, e.g. ; ; ; . A distinguishing characteristic of these platforms is that they do not rely on Sanger chemistry  as did first-generation machines including the Applied Biosystems Prism 3730 and the Molecular Dynamics MegaBACE. The second-generation machines are characterized by highly parallel operation, higher yield, simpler operation, much lower cost per read, and (unfortunately) shorter reads. Today’s machines are commonly referred to as short-read sequencers or next-generation sequencers (NGS) though their successors may be on the horizon. Pacific Biosciences machines  might produce reads longer than first-generation machines. First-generation reads were commonly 500bp to 1000bp long. Today’s NGS reads are in the 400bp range (from 454 machines), the 100bp range (from the Solexa and SOLID machines), or less. Shorter reads deliver less information per read, confounding the computational problem of assembling chromosome-size sequences. Assembly of shorter reads requires higher coverage, in part to satisfy minimum detectable overlap criteria. High coverage increases complexity and intensifies computational issues related to large data sets.
All sequencers produce observations of the target DNA molecule in the form of reads: sequences of single-letter base calls plus a numeric quality value (QV) for each base call . Although QVs offer extra information, their use generally increases a program’s CPU and RAM requirements. Only some of the NGS assembly software exploits QVs.
The NGS platforms have characteristic error profiles that change as the technologies improve. Error profiles can include enrichment of base call error toward the 3′ (terminal) ends of reads, compositional bias for or against high-GC sequence, and inaccurate determination of simple sequence repeats. There are published error profiles for the 454 GS 20 , the Illumina 1G Analyzer , and comparisons of three platforms . Some NGS software is tuned for platform-specific error profiles. Others may have unintentional bias where development targeted one data type.
Sanger platforms could deliver paired-end reads, that is, pairs of reads with a constraint on their relative orientation and separation in the target. Paired ends were essential to assembly of cellular genomes small  and large  due to their ability to span repeats longer than individual reads. Paired ends, also called mate pairs, have a separation estimate that is usually provided to software as the fragment size distribution measured on a so-called library of reads. A sufficient variety of paired end separations should help resolve large chromosomes to single scaffolds . Early NGS sequencers offered unpaired reads but late models support paired-end protocols. Early NGS assembly software targeted unpaired reads but later programs exploit paired ends as read placement constraints.
An assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target. It groups reads into contigs and contigs into scaffolds. Contigs provide a multiple sequence alignment of reads plus the consensus sequence. The scaffolds, sometimes called supercontigs or metacontigs, define the contig order and orientation and the sizes of the gaps between contigs. Scaffold topology may be a simple path or a network. Most assemblers output, in addition, a set of unassembled or partially assembled reads. The most widely accepted data file format for an assembly is FASTA, wherein contig consensus sequence can be represented by strings of the characters A, C, G, T, plus possibly other characters with special meaning. Dashes, for instance, can represent extra bases omitted from the consensus but present in a minority of the underlying reads. Scaffold consensus sequence may have N’s in the gaps between contigs. The number of consecutive N’s may indicate the gap length estimate based on spanning paired ends.
Assemblies are measured by the size and accuracy of their contigs and scaffolds. Assembly size is usually given by statistics including maximum length, average length, combined total length, and N50. The contig N50 is the length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly. The N50 statistics for different assemblies are not comparable unless each is calculated using the same combined length value. Assembly accuracy is difficult to measure. Some inherent measure of accuracy is provided by the degrees of mate-constraint satisfaction and violation . Alignment to reference sequences is useful whenever trusted references exist.
DNA sequencing technologies share the fundamental limitation that read lengths are much shorter than even the smallest genomes. WGS overcomes this limitation by over-sampling the target genome with short reads from random positions. Assembly software reconstructs the target sequence.
Assembly software is challenged by repeat sequences in the target. Genomic regions that share perfect repeats can be indistinguishable, especially if the repeats are longer than the reads. For repeats that are inexact, high-stringency alignment can separate the repeat copies. Careful repeat separation involves correlating reads by patterns in the different base calls they may have . Repeat separation is assisted by high coverage but confounded by high sequencing error. For repeats whose fidelity exceeds that of the reads, repeat resolution depends on “spanners,” that is, single reads that span a repeat instance with sufficient unique sequence on either side of the repeat. Repeats longer than the reads can be resolved by spanning paired ends, but the analysis is more complicated. Complete resolution usually requires two resources: pairs that straddle the repeat with each end in unique sequence, and pairs with exactly one end in the repeat. The limit of repeat resolution can be explored for finished genomes under some strict assumptions. For instance, it was shown that the theoretical best assembly of the E. coli genome from 20 bp unpaired reads would put 10% of bases in contigs of 10 Kbp or longer given infinite coverage and error-free reads . The limit calculation is not straightforward for reads with sequencing error, paired-end reads, or unfinished genomes. Careful estimates of repeat resolution involve the ratio of read length (or paired-end separation) to repeat length, repeat fidelity, read accuracy, and read coverage. In regard to NGS data, shorter reads have less power to resolve genomic repeats but higher coverage increases the chance of spanning short repeats.
Repeat resolution is made more difficult by sequencing error. Software must tolerate imperfect sequence alignments to avoid missing true joins. Error tolerance leads to false positive joins. This is a problem especially with reads from inexact (polymorphic) repeats. False-positive joins can induce chimeric assemblies. In practice, tolerance for sequencing error makes it difficult to resolve a wide range of genomic phenomena: polymorphic repeats, polymorphic differences between non-clonal asexual individuals, polymorphic differences between non-inbred sexual individuals, and polymorphic haplotypes from one non-inbred individual. If the sequencing platforms ever generate error-free reads at high coverage, assembly software might be able to operate at 100% stringency.
WGS assembly is confounded by non-uniform coverage of the target. Coverage variation is introduced by chance, by variation in cellular copy number between source DNA molecules, and by compositional bias of sequencing technologies. Very low coverage induces gaps in assemblies. Coverage variability invalidates coverage-based statistical tests, and undermines coverage-based diagnostics designed to detect over-collapsed repeats.
WGS assembly is complicated by the computational complexity of processing larger volumes of data. For efficiency, all assembly software relies to some extent on the notion of a K-mer. This is a sequence of K base calls, where K is any positive integer. In most implementations, only consecutive bases are used. Intuitively, reads with high sequence similarity must share K-mers in their overlapping regions, and shared K-mers are generally easier to find than overlaps. Fast detection of shared K-mer content vastly reduces the computational cost of assembly, especially compared to all-against-all pairwise sequence alignment. A tradeoff of K-mer based algorithms is lower sensitivity, thus missing some true overlaps. The probability that a true overlap spans shared K-mers depends on the value of K, the length of the overlap, and the rate of error in the reads. An appropriate value of K should be large enough that most false overlaps don’t share K-mers by chance, and small enough that most true overlaps do share K-mers. The choice should be robust to variation in read coverage and accuracy.
WGS assembly algorithms, and their implementations, are typically complex. Assembly operation can require high-performance computing platforms for large genomes. Algorithmic success can depend on pragmatic engineering and heuristics, that is, empirically derived rules of thumb. Heuristics help overcome convoluted repeat patterns in real genomes, random and systematic error in real data, and the physical limitations of real computers.
Metagenomics is the sequencing of DNA in an environmental sample. Whereas WGS targets one genome, metagenomics usually targets several. The metagenomics assembly problem is confounded by genomic diversity and variable abundance within populations. Assembly reconstructs the most abundant sequences . Simulations indicate high rates of chimera, especially in short contigs assembled from complex mixtures . Studies that rely on characterization of individual reads prefer long reads . The role for de novo genomic assembly from NGS metagenomics data should grow as NGS read lengths and NGS paired-end options increase.
We organize the NGS assemblers into three categories, all based on graphs. The Overlap/Layout/Consensus (OLC) methods rely on an overlap graph. The de Bruijn Graph (DBG) methods use some form of K-mer graph. The greedy graph algorithms may use OLC or DBG.
A graph is an abstraction used widely in computer science. It is a set of nodes plus a set of edges between the nodes. Nodes and edges may also be called vertices and arcs, respectively. If the edges may only be traversed in one direction, the graph is known as a directed graph. The graph can be conceptualized as balls in space with arrows connecting them. Importantly, each directed edge represents a connection from one source node to one sink node. Collections of edges form paths that visit nodes in some order, such that the sink node of one edge forms the source node for any subsequent nodes. A special kind of path, called a simple path, is one that contains only distinct nodes (each node is visited at most once). A simple path may not intersect itself, by definition, and one may additionally require that no other paths intersect it. The nodes and edges may be assigned a variety of attributes and semantics.
An overlap graph represents the sequencing reads and their overlaps . The overlaps must be pre-computed by a series of (computationally expensive) pair-wise sequence alignments. Conceptually, the graph has nodes to represent the reads and edges to represent overlaps. In practice, the graph might have distinct elements or attributes to distinguish the 5′ and 3′ ends of reads, the forward and reverse complement sequences of reads, the lengths of reads, the lengths of overlaps, and the type of overlap (suffix-to-prefix or containment). Paths through the graph are the potential contigs, and paths can be converted to sequence. Paths may have mirror images representing the reverse complement sequence. There are two ways to force paths to obey the semantics of double-stranded DNA. If the graph has separate nodes for read ends, then paths must exit the opposite end of the read they enter. If the graph has separate edges for the forward and reverse strands, then paths must exit a node on the same strand they enter.
The de Bruijn graph was developed outside the realm of DNA sequencing to represent strings from a finite alphabet. The nodes represent all possible fixed-length strings. The edges represent suffix-to-prefix perfect overlaps.
A K-mer graph is a form of de Bruijn graph. Its nodes represent all the fixed-length subsequences drawn from a larger sequence. Its edges represent all the fixed-length overlaps between subsequences that were consecutive in the larger sequence. In one formulation , there is one edge for the K-mer that starts at each base (excluding the last K-1 bases). The nodes represent overlaps of K-1 bases. Alternately , there is one node representing the K-mer that starts at each base. The edges represent overlaps of K-1 bases. By construction, the graph contains a path corresponding to the original sequence (Figure 1). The path converges on itself at graph elements representing K-mers in the sequence whose multiplicity is greater than one.
A repeat graph is an application of the K-mer graph . It provides a succinct graph representation of the repetitiveness of a genome. Nodes and edges are drawn from an assembled reference sequence. Whereas non-repetitive genomic sequence would induce a single path through the graph, repeats induce convergence and divergence of paths, as well as cycles. Repeat graphs can be used to identify and catalog repeats .
A K-mer graph may represent many sequences instead of one. In its application to WGS assembly, the K-mer graph represents the input reads. Each read induces a path. Reads with perfect overlaps induce a common path. Thus, perfect overlaps are detected implicitly without any pair-wise sequence alignment calculation (Figure 2). Compared to overlap graphs, K-mer graphs are more sensitive to repeats and sequencing errors. Paths in overlap graphs converge at repeats longer than a read, but paths in K-mer graphs converge at perfect repeats of length K or more, and K must be less than the read length. Each single-base sequencing error induces up to K false nodes in the K-mer graph. Each false node has a chance of matching some other node and thereby inducing a false convergence of paths.
Real-world WGS data induces problems in overlap graphs and K-mer graphs.
In general, branching and convergence increases graph complexity, leading to tangles that are difficult to resolve. Much of the complexity is due to repeats in the target and sequencing error in the reads.
In the graph context, assembly is a graph reduction problem. Most optimal graph reductions belong to a class of problems, called NP-hard, for which no efficient solution is known . Therefore, assemblers rely on heuristic algorithms and approximation algorithms to remove redundancy, repair errors, reduce complexity, enlarge simple paths and otherwise simplify the graph.
The greedy algorithms apply one basic operation: given any read or contig, add one more read or contig. The basic operation is repeated until no more operations are possible. Each operation uses the next highest-scoring overlap to make the next join. The scoring function measures, for instance, the number of matching bases in the overlap. Thus the contigs grow by greedy extension, always taking on the read that is found by following the highest-scoring overlap. The greedy algorithms can get stuck at local maxima if the contig at hand takes on reads that would have helped other contigs grow even larger.
The greedy algorithms are implicit graph algorithms. They drastically simplify the graph by considering only the high-scoring edges. As an optimization, they may actually instantiate just one overlap for each read end they examine. They may also discard each overlap immediately after contig extension.
Like all assemblers, the greedy algorithms need mechanisms to avoid incorporating false-positive overlaps into contigs. Overlaps induced by repetitive sequence may score higher than overlaps induced by common position of origin. An assembler that builds on false-positive overlaps will join unrelated sequences to either side of a repeat to produce chimera.
SSAKE  was the first short-read assembler. It was designed for unpaired short reads of uniform length. It was based on the notion that high coverage would provide a tiling in error-free reads if the erroneous reads could be avoided. SSAKE does not use a graph explicitly. It does use a lookup table of reads indexed by their prefixes. SSAKE iteratively searches for reads that overlap one contig end. Its candidate reads must have a prefix-to-suffix identical overlap whose length is above a threshold. SSAKE chooses carefully among multiple reads with equally long overlaps. First, it prefers reads with end-to-end confirmation in other reads. This favors error-free reads. Second, the software detects when the set of candidates presents multiple extensions. In particular, it detects when the candidate read suffixes exhibit differences that are each confirmed in other reads. This is equivalent to finding a branch in a graph. At this point, the software terminates the contig extension. Users can elect to override the “stringent” behavior, in which case SSAKE takes the higher-scoring extension. When no reads satisfy the initial minimum threshold, the program decrements the threshold until a second minimum is reached. Thus, user settings determine how aggressively SSAKE extends through possible repeat boundaries and low-coverage regions. SSAKE has been extended to exploit paired-end reads and imperfectly matching reads .
SHARCGS  also operates on uniform-length, high-coverage, unpaired short reads. It adds pre- and post-processor functionality to the basic SSAKE algorithm. The preprocessor filters erroneous reads by requiring a minimum number of full-length exact matches in other reads. An even higher-stringency filter is optional, requiring that the combined QVs of matching reads exceed a minimum threshold. SHARCGS filters the raw read set three times, each at a different stringency setting, to generate three filtered sets. It assembles each set separately by iterative contig extension. Then, in a postprocess, it merges the three contig sets using sequence alignment. The merger aims to extend contigs from highly confirmed reads by integrating longer contigs from lower-stringency filters.
VCAKE  is another iterative extension algorithm. Unlike its predecessors, it could incorporate imperfect matches during contig extension. VCAKE was later combined with Newbler in a pipeline for Solexa+454 hybrid data . Another pipeline had combined Newbler and Celera Assembler for 454+Sanger hybrid data . Both pipelines “shred” contigs from the first assembler to produce pseudo-reads suitable for the second assembler. The latter pipeline adjusts the read coverage and base call quality values in the pseudo-reads it generates. This helps the secondary assembler give appropriate weight to high-coverage contigs from the primary assembly, for instance during consensus base calling.
The OLC approach was typical of the Sanger-data assemblers. It was optimized for large genomes in software including Celera Assembler , Arachne ;, and CAP and PCAP . The OLC approach has been reviewed elsewhere ; ; .
OLC assemblers use an overlap graph. Their operation has three phases:
Newbler  is widely used software distributed by 454 Life Sciences. The first release, described in the published supplement, targeted unpaired reads of approximately 100bp as generated by the GS 20 machine. Newbler has since been revised, in particular to build scaffolds from paired-end constraints. As described in 2005, Newbler implements OLC twice. The first-phase OLC generates unitigs from reads. Unitigs are mini-assemblies that are, ideally, uncontested by overlaps to reads in other unitigs . The unitigs serve as preliminary, high-confidence, conservative contigs that seed the rest of the assembly pipeline. The second-phase OLC generates larger contigs from the unitigs. This phase joins unitigs into a contig layout based on pair-wise overlaps between unitigs. It may split unitigs whose prefix and suffix align to different contigs. Unitig splitting may split individual reads, leading to reads placed in multiple contigs. Such reads may have been chimera or derived from a repeat boundary.
Newbler exploits coverage, if possible, to overcome base calling error. In particular, it uses instrument metrics to overcome inaccurate calls of the number of bases in homopolymer repeats. Newbler calculates unitig and contig consensus in “flow space” using the platform-supplied signal strength associated with each flow of a particular nucleotide. The normalized signal is proportionally correlated to the number of direct repeats of that nucleotide at that position in the read. Consensus calculation in “base space,” equivalent to rounding the signals before averaging, would sacrifice precision. For each column in the MSA, Newbler calculates the average signal and rounds to an integer to form the consensus.
The Newbler package offers functionality beyond de novo assembly, and it includes a comprehensive user guide. The Newbler software is distributed with the 454 sequencing machines. Customers receive frequent updates. Release descriptions indicate that recent versions differ from the published algorithm. The source code is not generally available.
The Celera Assembler  is a Sanger-era OLC assembler revised for 454 data . The revised pipeline, called CABOG, discovers overlaps using compressed seeds. CABOG reduces homopolymer runs, that is, repeats of single letters, to single bases to overcome homopolymer run length uncertainty in data. CABOG builds initial unitigs excluding reads that are substrings of other reads. Substrings account for a large portion of the data in high-coverage 454 data due to highly variable read lengths. CABOG avoids the substring-reads initially because they are more susceptible to repeat-induced false overlaps.
CABOG applies a base call correction scheme first described for Arachne . It compares each read to its set of overlapping reads. It infers sequencing error at any base contradicted by a preponderance of overlaps. It does not fix the read. Rather, it adjusts the tabulated error rates in overlaps spanning the inferred error. Next, it applies a user-supplied threshold for error rates. From the overlaps that survive the error filter and a filter for minimum alignment length, CABOG selects one “best” overlap per read end. Best is defined as aligning the most bases. The best-overlap filter is presumed to eliminate many of the same overlaps removed by the more time-costly transitive edge removal algorithm , which was employed by the original Celera Assembler. CABOG constructs an overlap graph from the reads and “best” overlaps. Within the graph, it builds unitigs from maximal simple paths that are free of branches and intersections. CABOG next constructs a graph of unitigs plus paired-end constraints. Within that graph, it joins unitigs into contigs and connects contigs into scaffolds. It applies a series of graph reductions including removal of transitively inferable edges. Finally, CABOG derives consensus sequences by computing multiple sequence alignments from the scaffold layouts plus the read sequences.
Two assemblers apply the OLC approach to the short reads from the Solexa and SOLiD platforms. The Edena software  discards duplicate reads and finds all perfect, error-free, overlaps. It removes individual overlaps that are redundant with pairs of other overlaps, an application of the transitive overlap reduction algorithm . Edena prunes spurs and bubbles. Edena was designed for unpaired reads of uniform length. The Shorty software  attacks the special case where a few long reads are available to act as seeds to recruit short reads and their mate pairs. Proceeding in iterations, Shorty uses contigs to seed larger contigs.
The third approach to assembly is most widely applied to the short reads from the Solexa and SOLiD platforms. It relies on K-mer graphs, whose attributes make it attractive for vast quantities of short reads. The K-mer graph does not require all-against-all overlap discovery, it does not (necessarily) store individual reads or their overlaps, and it compresses redundant sequence. Conversely, the K-mer graph does contain actual sequence and the graph can exhaust available memory on large genomes. Distributed memory approaches may alleviate this constraint.
The K-mer graph approach dates to an algorithm for Sanger read assembly  based on a proposal  for an even older sequencing technology; see  for review. The approach is commonly called a de Bruijn graph (DBG) approach or an Eulerian approach  based on an ideal scenario. Given perfect data – error-free K-mers providing full coverage and spanning every repeat – the K-mer graph would be a de Bruijn graph and it would contain an Eulerian path, that is, a path that traverses each edge exactly once. The path would be trivial to find making the assembly problem trivial by extension. Of course, K-mer graphs built from real sequencing data are more complicated.
To the extent that the data is ideal, assembly is a by-product of the graph construction. The graph construction phase proceeds quickly using a constant-time hash table lookup for the existence of each K-mer in the data stream. Although the hash table consumes extra memory, the K-mer graph itself stores each K-mer at most once, no matter how many times the K-mer occurs in the reads. In terms of computer memory, the graph is smaller than the input reads to the extent that the reads share K-mers.
Pevzner  explored problems that genomic repeats introduce. Repeats induce cycles in the K-mer graph. These would allow more than one possible reconstruction of the target. Idury and Waterman  also explored problems of real data. They added two extra types of information to the K-mer graph and named the result a sequence graph. Each edge was labeled with the reads, and positions within each read, of the sequences that induced it. Where nodes had one inbound and one outbound edge, the three elements could be compressed into one edge. This was called the elimination of singletons. Further research led to the Euler software implementation  for Sanger data. Impractical for large-scale Sanger sequencing projects, Euler and the DBG approach were well positioned when the Illumina platform started to produce data composed of very short unpaired reads of uniform size.
Three factors complicate the application of K-mer graphs to DNA sequence assembly.
The Euler software was developed for Sanger reads ; ; . It was subsequently modified for short 454 GS20 reads , even shorter unpaired Illumina/Solexa reads , and paired-end Solexa reads .
Euler applies a filter to the reads before it builds its graph. The filter detects erroneous base calls by noting low-frequency K-mers. The filter relies on redundancy of reads: most true K-mers should be repeated in several reads. The filter also relies on randomness of sequencing error: for any K where 4K exceeds twice the genome size, most erroneous K-mers should be unique. The Euler filter is implemented with a list of K-mers and their observed frequencies in the reads. The filter excludes or corrects low-frequency K-mers. Correction is especially important for short reads with high error and high coverage. Correction reduces the total number of K-mers and thus the node count in the graph. Correction risks masking true polymorphism . Correction can corrupt valid K-mers that had low coverage by chance. Correction could leave a read incorrect by settling on K-mers that each occur in several reads but never occur together in any read. OLC assemblers have an analogous base call correction step that uses overlaps rather than K-mers.
The Euler filter process is called spectral alignment . The software identifies sequencing error by comparing K-mer content between individual reads and all reads. It distrusts individual-read K-mers whose frequency in all reads is below a threshold. The threshold is chosen after calculating the distribution of K-mer frequencies present in the reads. The distribution is usually bi-modal. The first peak represents the many K-mers that occur once or twice, due to sequencing error (or low coverage). The second peak represents the redundant K-mers induced by the read coverage (or genomic repeats). Euler selects the threshold between the peaks and effectively labels all K-mers as bad or good. Euler then examines each read. For each read with bad K-mers, it executes a greedy exploration for base call substitutions that reduce the bad K-mer count . Finally, it either accepts a fully corrected read or rejects the read. (Rejected reads can be re-introduced to patch low-coverage regions after assembly.) Note Euler corrects substitution errors but not insertions or deletions, i.e. indels. Substitutions are the most common base call error type in Solexa data .
Euler builds a K-mer graph from the filtered and corrected reads. Then it applies a series of graph manipulations designed to overcome the effects of sequencing errors and genomic repeats.
By processing K-mers not reads, the K-mer graph construction discards long-range continuity information in the reads. Euler repairs this defect by threading the reads through the graph. Mapping a read onto the graph is easy. At least initially, the K-mers in reads map to unique nodes and reads are consistent with some path. Exploitation of the mapping is more complex. Reads ending inside a repeat are consistent with any path exiting the repeat, but reads spanning a repeat are consistent with fewer paths. For the latter, read threading pulls out one piece of string from a frayed rope pattern, thus resolving one copy of the collapsed repeat (Figure 4a). Thus read threading constrains the set of valid paths through the graph. This allows resolution of repeats whose length is between K and the read length.
A paired-end read is effectively a long read that is missing base calls in the middle. Euler treats paired ends this way to resolve repeats longer than individual reads. The technique could be called mate threading. Paired ends that span a repeat provide the evidence to join one path that enters a repeat to one path that exits the repeat (Figure 4b). Paired ends can also resolve some complex tangles induced by repeats. A complex graph may have multiple paths between two nodes corresponding to opposite ends of a mate pair. Each path implies a putative DNA sequence. In many cases, only one of the paths implies a sequence whose length satisfies the paired-end constraint (Figure 4c). Between any mate pair, there could be too many paths for exhaustive search to be feasible. Euler seems to restrict the search space using the mate constraint as a bound on path length. (The number of paths in a general graph scales with NE for N nodes and E out-going edges per node. K-mer graphs of DNA sequence can be constrained to E≤4 to represent the 4 possible one-letter extensions to a sequence. The constraint is violated by some graph simplifications, and N4 is still not tractable.)
After threading, Euler implements graph simplifications at regions with low and high depth of coverage in reads. Euler’s spur erosion reduces branching in graph paths and thereby lengthens simple paths. The spurs are presumed due to sequencing error that survived the spectral alignment filter. Euler identifies remaining edges that appear repetitive and removes them from the set of paths. This is equivalent to breaking contigs at repeat boundaries in OLC assembly.
Reads from many platforms contain lower quality base calls at their 3′ ends, and the problem can be exacerbated by long-read protocols on short-read platforms. Euler addresses this problem by trusting read prefixes more than their suffixes. It chooses trustable prefixes during the error correction step. Prefix length varies per read. During read threading, prefixes and suffixes can map to multiple paths. By a heuristic, Euler trusts mappings that are significantly better than their second-best choice. Just as the suffixes would add coverage to a multiple sequence alignment, they add connectivity to the graph. The extra sequence leads to greater contig size. Euler chooses not to alter the assembly consensus sequence based on the suffixes, so the mapped suffixes contribute connectivity only.
Overlap graphs are sensitive to the minimum overlap length threshold, and K-mer graphs are sensitive to the parameter K. Larger values of K resolve longer repeats but they also fracture assemblies in regions of low read coverage. Euler addresses this with a heuristic. Euler constructs and simplifies two K-mer graphs with different values of K. It identifies edges present in the smaller-K graph that are missing in the larger-K graph. It adds corresponding pseudo-edges to the second graph. The borrowed edges extend paths in the second graph and thus enlarge contigs in the assembly. This technique effectively uses large K-mers to build reliable initial contigs, and then fills gaps with more prolific small K-mers. This is analogous to gap filling approaches in OLC assemblers .
Some of the Euler software incorporates another structure called the A-Bruijn graph. It gets its name from being a combination of a de Bruijn graph and an adjacency matrix or A-matrix. Nodes of the graph represent consecutive columns in multiple sequence alignments. Compared to nodes representing K-mers in individual reads, the adjacency nodes can be less sensitive to sequencing error. The A-Bruijn graph was deployed for converting a genome sequence to a repeat graph and classifying repeats. It was proposed as a basis for assembly .
In summary, Euler compares de Bruijn graphs built from different K-mer sizes. Euler applies heuristics to mitigate graph complexity induced by sequencing error. It exploits low-quality read ends and paired-end constraints to tease apart graph tangles induced by genomic repeats. The software targets de novo assembly from short reads, including paired-ends, from the Solexa platform.
Velvet [25; 56] is a reliable and easy to use DBG assembler. Velvet makes extensive use of graph simplification to reduce simple non-intersecting paths to single nodes. Simplification compresses the graph without loss of information. Velvet invokes the simplification step during graph construction and again several times during the assembly process. The technique, introduced as elimination of singletons for K-mer graphs , is analogous to unitig formation in overlap graphs  and OLC assemblers .
Velvet prunes the K-mer graph by removing spurs iteratively. Its tip removal algorithm is similar to Euler’s erosion procedure. The spur removal drastically reduced the graph size on real data , possibly because it was the pipeline’s first attempt at filtering out base call errors. Velvet does not implement Euler’s spectral alignment filter. Velvet has a parameter for the minimum number of occurrences in the reads for a K-mer to qualify as a graph node. The Velvet publication seems to discourage use of this naïve filter.
Velvet reduces graph complexity with a bounded search for bubbles in the graph. Velvet’s tour bus algorithm uses breadth-first-search, fanning out as much as possible, starting at nodes with multiple out-going edges. Since graphs of real data can have bubbles within bubbles, an exhaustive search for all bubbles would be impractical. The search is bounded to make it tractable; the candidate paths are traversed in step, moving ahead one node on all paths per iteration, until the path lengths exceed a threshold. Velvet narrows the bubble candidates to those with a sequence similarity requirement on the alternate paths. Having found a bubble, Velvet removes the path representing fewer reads and, working outside the graph, re-aligns reads from the removed path to the remaining path. Because higher read multiplicity determines the target path, the re-aligner effectively calls the consensus bases by a column-wise voting algorithm. The operation risks “papering over” genuine sequence differences due to polymorphism in the donor DNA or over-collapse of near-identical repeats. Velvet’s algorithm is similar to bulge removal in Euler  and analogous to bubble detection and bubble smoothing in OLC assemblers .
Velvet further reduces graph complexity by read threading. It removes paths that represent fewer reads than a threshold. This operation risks removing low-coverage sequence but it is thought to remove mostly spurious connections induced by convergent sequencing errors. Velvet exploits long reads, if any were provided, by an algorithm it calls Rock Band. This forms nodes out of paths that are confirmed by two or more long reads provided no other two long reads provide a consistent contradiction.
Velvet’s final graph reduction involves mate pairs. Early versions used an algorithm called breadcrumb  that is similar to mate pair threading in DBG algorithms  and gap filling in OLC algorithms . It operated on pairs of long contigs (simple paths) connected by paired-ends. Using the long contigs as anchors, it tried to fill the gap between them with short contigs. It gathered short contigs linked to the long contigs, and applied a breadth-first search through the DBG for a single path linking the long contigs and by traversing the short contigs. Later versions of Velvet use an algorithm called pebble . In this algorithm, unique and repeat contigs substitute for breadcrumb’s long and short contigs, respectively. The unique/repeat classifier is based on read coverage per contig. Velvet uses a statistical test similar to Celera Assembler’s A-stat  and a given coverage expectation. It exploits the given (per-library) insert length distribution to build an approximate contig layout. It searches the DBG for a path that is consistent with the layout.
Velvet may be run several times per data set to optimize selection of three critical parameters. The length of the K-mers is constrained to be odd to preclude nodes representing palindromic repeats. The minimum frequency expected of K-mers in the reads determines which K-mers are pruned a-priori. The expected coverage of the genome in reads controls spurious connection breaking.
In summary, Velvet offers a full implementation of DBG assembly. It does not use an error-correction pre-processor, though it does have an error-avoidance read filter. It applies a series of heuristics that reduce graph complexity. The heuristics exploit local graph topology, read coverage, sequence identity, and paired-end constraints. The software targets de novo assembly from short reads with paired ends from the Solexa platform. An extension allows it to assemble data sets composed solely of SOLiD reads (www.solidsoftwaretools.com). Memory requirements currently preclude Velvet from assembling very large genomes.
ABySS is a distributed implementation  designed to addresses memory limitations of mammalian-size genome assembly by DBG. ABySS distributes the K-mer graph, and the graph computations, across a compute grid whose combined memory is quite large. This scheme allowed it to assemble 3.5 billion Solexa reads from a human donor.
Several problems confront any single-CPU algorithm that is ported to a grid. It must be possible to partition the problem for parallel computation, to distribute the partitions evenly across the grid, and to marshal the results. ABySS partitions the assembly at the granularity of the individual graph node. (Each graph node is processed separately. For efficiency, many graph nodes are assigned to each CPU.) The assignment of a graph node to CPU is accomplished by converting the K-mer to an integer. The formula is strand-neutral such that a K-mer and its reverse complement map to the same integer. The formula would be helpful if it somehow mapped neighbor K-mers to the same CPU. It is not clear to what extent this was accomplished.
ABySS introduces parallel implementations of graph simplifications present in Euler and Velvet. ABySS iteratively removes spurs shorter than a threshold. ABySS performs bubble smoothing by bounded search and prefers the path supported by more reads. It also transforms simple non-intersecting paths into contigs and performs mate threading. ABySS uses a compact representation of the K-mer graph. Each graph node represents a K-mer and its reverse complement. Each graph node keeps 8 bits of extra information: the existence or non-existence of each of the four possible one-letter extensions at each end. The graph edges are implicit in this extra information. ABySS follows paths in parallel starting at arbitrary graph nodes per CPU. From any node, ABySS finds its successor elegantly: the node’s last K-1 bases, plus a one-base extension indicated by an edge, is converted numerically to the address (including CPU assignment) of the successor node. When a path traverses a node on a different CPU, the process emits a request for information. Since inter-CPU communication is typically slow, the process works on other graph nodes while waiting for the response.
In a post-process, ABySS exploits paired-end reads to merge contigs. ABySS does not build scaffolds. In summary, ABYSS is scalable assembly software for Solexa short reads and paired end reads.
AllPaths uses a read-correcting pre-processor related to spectral alignment in Euler. It trusts K-mers that occur at high frequency in reads and at high quality (where each base must be confirmed by a minimum number of base calls with QV above a threshold). It retains reads whose K-mers are trusted. The filter operates on K-mers for three values of K. The filter is relaxed in two ways. Reads are restored if up to two substitutions to low-QV base calls make its K-mers trusted. K-mers are restored later if they are essential for building a path between paired-end reads.
AllPaths invokes a second pre-processor that creates “unipaths.” The process begins with the calculation of perfect read overlaps seeded by K-mers. It assigns numerical identifiers to the K-mers such that many K-mers seen consecutively in reads and overlaps receive consecutive identifiers. It populates a database of identifier intervals and the read links between them. It merges intervals to an extent that is consistent with all the reads. The operation is equivalent to DBG construction followed by elimination of singletons. The database implementation probably reduces the RAM requirement for the graph construction, which comes next.
AllPaths builds a DBG from the database. Its first graph operation is spur erosion, which it calls unitig graph shaving. AllPaths partitions the graph. Its goal is to resolve genomic repeats by assembling regions that are locally non-repetitive. It applies heuristics to choose partitions that form a tiling path across the genome. It seeds partitions with nodes corresponding to long, moderately covered, widely separated contigs. It populates partitions with nodes and reads linked by alignments and mate pairs. It seeks to fill gaps between paired-end reads by searching the K-mer graph for instances where exactly one path satisfies the distance constraint. For tighter variance on constraints, it uses short-range paired-ends first. AllPaths assembles each partition separately and in parallel. Then, AllPaths “glues” the local graphs where they have overlapping structure. This is analogous to joining contigs based on sequence overlaps. In the global graph, AllPaths heuristically removes spurs, small disconnected components, and paths not spanned by paired-ends. It unrolls cycles to match paired-end distance constraints; this decides copy number on tandem repeats. It also uses paired-ends to tease apart collapsed repeats that display the frayed rope pattern.
In summary, AllPaths targets large-genome assembly using paired-end short reads from the Solexa platform. Its read filter uses quality values to fix some substitution errors. It simplifies its K-mer graph initially based on reads and overlaps. Going beyond read and mate threading, it applies the read and paired-end data outside the graph in a divide-and-conquer approach.
SOAP filters and corrects reads using pre-set thresholds for K-mer frequencies. It builds a de Bruijn graph and erodes the tips. SOAP threads the reads and splits paths that display a symmetrical frayed rope pattern. SOAP removes bubbles with an algorithm like Velvet’s tour bus, with higher read coverage determining the surviving path. Though SOAP’s DBG implementation borrows from Euler and Velvet, its graph is more space-efficient. Devoid of read-tracking information, the graph required only 120GB RAM to store 5 billion nodes from 3.3 billion reads (after filtering). As described , the graph was constructed without data from large-insert libraries because those were suspected of generating many chimeric reads.
SOAP builds contigs from reads by the DBG method. Then it discards the de Bruijn graph to build scaffolds. It maps all paired reads to the contig consensus sequences, including reads not used in the DBG. Then it builds a contig graph whose edges represent the inter-contig mate pair constraints. SOAP reduces complexity in the contig graph by removing edges that are transitively inferable from others. It also isolates contigs traversed by multiple, incompatible paths, since these appear to be collapsed repeats. Similar techniques are implemented in CABOG and its predecessor, Celera Assembler. As does AllPaths, SOAP processes the edges in order of insert size, from small to large. SOAP does this to preclude construction of scaffolds that interleave others. SOAP uses mate pairs to assign reads to gaps between neighbor contigs within a scaffold. This is similar to CABOG’s “rocks and stones” technique [37; 62] and to Velvet’s breadcrumbs and pebble techniques. SOAP uses de Bruijn graphs to assemble the reads assigned to each gap. In summary, SOAP is a large-genome implementation that amalgamates OLC and DBG techniques from its NGS assembly predecessors.
The PCAP long-read assembler also assembles 454 reads . LOCAS targets low-coverage short-read data (www-ab.informatik.uni-tuebingen.de/software/locas). The MIRA long-read assembler also assembles short reads (chevreux.org/projects_mira.html), as does Forge . Similar to Edena, Taipan  implements greedy contig extension and transitive overlap reduction for unpaired short reads. Proprietary, commercial short-read assemblers include CLC Workbench (www.clcbio.com) and SeqMan (www.dnastart.com). Complete Genomics offers human genome sequencing services without distributing its sequencing or assembly technology (www.completegenomics.com). SHRAP is a protocol for sequencing human-scale genomes  with short reads by clone-based sequencing and a hierarchical assembly strategy. The protocol was demonstrated with Euler on simulated data. Network flow analysis has been described for resolution of repeat copy number during genome assembly [52; 66]. It was implemented with generic flow analysis software .
The SOLiD platform from ABI has an unusual characteristic that requires special attention in software. The base calls are represented using the four digits 0, 1, 2, 3. These are referred to as colors. Each color is a di-base encoding of one base in relation to its preceding base. The reads’ first base is a constant per run. Software gains robustness to error by forming alignments in color space and testing their validity by conversion to base space. Software support for de novo assembly of SOLiD reads is limited and has not been formally described. ABI provided an update to Velvet (www.solidsoftwaretools.com) that assembles homogeneous SOLiD data sets. There is de novo assembly support in commercial software such as CLC Workbench (www.clcbio.com). Some packages that assemble SOLiD reads in the presence of Sanger reads, 454 reads, or contigs include Shorty , SeqWrite (www.seqwright.com) and NextGENe (www.softgenetics.com).
An alternative to de novo assembly is mapping. For some applications, sufficient information can be extracted from the mapping of reads to a reference sequence, such as a finished genome from a related individual. Mapped reads can reveal small-scale population differences such as substitutions and indels. (These are routinely referred to as single-nucleotide substitutions, or SNPs, and deletion-insertion polymorphisms, or DIPs, respectively.) Mapped mate pairs can reveal larger indels and structural re-arrangements. Using a mapping, it is possible to construct contigs and scaffolds, each with their consensus sequences, as in de novo assembly.
Short-read mapping software is widely available. Published software includes SOAP [68; 69], MAQ , Bowtie , RMAP , CloudBurst , SHRiMP , RazerS , PerM , segemehl , GenomeMapper , and BOAT . Platform-specific mappers are available from platform vendors: Eland from Illumina (www.illumina.com), Corona from ABI , and Reference Mapper from 454 (www.454.com). Other commercial or unpublished solutions include ZOOM , CLC Workbench (www.clcbio.com), Novoalign (www.novocraft.com), Myrialign (savannah.nongnu.org/projects/myrialign) and Mr. Fast (mrfast.sf.net). Mapping-based variant discovery algorithms include MoDIL , VariationHunter  and BreakDancer . Somewhere between de novo and mapping assembly there is reference-guided assembly. Software implementations include AMOScmp  as revised for short reads, MOSAIK as featured in a short-read, cross-species comparison , and SeqMan which is commercial software (www.dnastar.com). A related approach, called gene-boosted assembly, finishes an assembly by comparison of related genomes and protein sequences . Our survey is no doubt incomplete and we apologize for omissions.
We have compared a dozen algorithms for the de novo assembly of whole-genome shotgun (WGS) data from next-generation sequencing (NGS) platforms. Table 1 offers a summary feature comparison. Our comparison is an interpretation of the primary literature. Few of the published descriptions include measures of the relative contributions by their various algorithmic features, and none include the unit tests and controls that might verify that the implementations follow the algorithms. No algorithm or implementation solves the WGS assembly problem. Each of the various software packages was published with claims about its own superiority. Each package seems to improve in subsequent software releases. In our experience, the success of an assembler depends largely on the sophistication of its heuristics for real reads including error, real genomes including repeats, and the limitations of modern computers.
Large genomes tend to present more complex repeat structures. Most of the NGS assemblers discussed here were initially published with assemblies of bacterial-length genomes (or larger genomes by simulated reads). Four de novo assemblies of mammalian-scale genomes from Solexa short reads have been described in the scientific literature. The first assembly of the human genome from short reads  represented a software engineering feat. Later assemblies of panda  and two humans  used slightly larger reads and more variety in pair insert sizes and generated larger contigs. Improved mammalian assemblies are certain to become reality thanks to decreases in sequencing cost, increasing read lengths by every platform, more variety in paired-end protocols, continual improvement to assembly software, and the application of more powerful compute nodes and grids.
In the graph context, optimal path discovery is impractical because there can be an exponential number of paths between any source and sink node. Each assembler searches the graph but each constrains its searches for reasons of scale. A search in CABOG’s overlap graph follows only one outward edge per node. Searches in AllPaths and Velvet examine local regions of the K-mer graph that are anchored by large contigs and populated by direct and transitive paired-end linkage. Euler and ABySS seem to rely on branch-and-bound search algorithms.
Common to most assemblers, a core set of features is apparent:
OLC and DBG are two robust approaches to assembly. Both rely to some extent on a set of overlaps between the input reads. Both represent the overlaps in a directed graph. Their different graph representations are similar if not equivalent . The OLC approach directly incorporates connections (overlaps) of varying length, as expected in long-read, low-coverage data. K-mer graphs are limited initially to short connections (shared K-mers) of uniform size, though read threading mitigates this. The DBG approach is more appropriate for the large volumes of reads associated with short-read sequencing. DBG avoids the computationally expensive all-against-all pair-wise read comparisons. DBG avoids loading all the replicate sequences associated with high-coverage sequencing.
Both techniques must contend with noisy data. Sequencing error induces false positive and false negative overlaps. Noise filtration is inherent in overlap graphs constructed from non-identical alignments. K-mer graphs are more sensitive to sequencing error, as every miscalled base introduces up to K erroneous nodes. The OLC and DBG approaches both employ pre-processing steps to filter or correct unconfirmed portions of reads, as well as post-processes to repair graphs by erosion, smoothing, and threading.
The OLC assemblers grew up on long reads. The DBG assemblers proliferated with the introduction of short reads. The OLC assemblers target variable-length reads in the 100–800 bp range. The DBG assemblers target uniformly sized reads in the 25–100 bp range. Reads of the near future may be intermediate-sized, matching more closely the expectations of OLC assemblers. Further ahead, very long reads may become feasible while short reads may become even more affordable.
Reads of the future will challenge assembly software in many ways. Sequencing platforms may reveal inter-read associations richer than the paired-end model allows. Future platforms may target error rates that are higher or lower than today’s standards. Almost certainly, data volume will continue to increase while manufacturing cost declines. The next-generation technology will surely be applied to larger genomes, more repetitive sequences, and less homogeneous samples. Developers of assembly algorithms will continue to be challenged by novel applications and larger data sets. Simultaneously, assembly developers will be challenged to extract more useful information from each sequence to enable lower coverage per genome and thus higher yield. The quest for more powerful and efficient assembly software remains an area of critical research.
The authors receive funding for assembly research from the National Institutes of Health via grant 2R01GM077117-04A1 from the National Institute of General Medical Sciences.
Jason Rafe Miller manages software research and development on whole-genome shotgun assembly at the J. Craig Venter Institute (JCVI). He previously contributed to assembly infrastructure development at TIGR, genome comparison and annotation software at Celera Genomics, and genomics visualization software at GlaxoSmithKline. Mr. Miller received a Master’s degree from University of Pennsylvania and a Bachelor’s degree from New York University.
Sergey Koren, a software engineer at the J. Craig Venter Institute, is a Ph.D. student at the University of Maryland, College Park, where he received his Master of Science and Bachelor of Science (cum laude, with honors) degrees in Computer Science. His research interests include genome assembly, application of graph analysis to metagenomics, and applications of high-performance computing. He is a contributor to the Celera Assembler, AMOS, and k-mer Tools projects hosted on SourceForge.
Dr. Granger Sutton is an informatics researcher at the J. Craig Venter Institute (JCVI). Prior to joining JCVI, Dr. Sutton was a director in the Informatics Research department at Celera Genomics where he developed and managed research programs in gene finding, comparative genomics, and shotgun fragment assembly including the development of the Celera Assembler for assembling the human genome. As Computer Scientist at The Institute for Genomic Research (TIGR), he developed protein homology search, multiple sequence alignment, and shotgun fragment assembly algorithms. Dr. Sutton also worked at AT&T Bell Labs to design and implement office automation software. Dr. Sutton earned his Bachelor’s degree in electrical engineering and Doctorate in Computer Science from University of Maryland, College Park, and a Master’s degree in Computer Engineering from Stanford University.
CONFLICT OF INTEREST: None.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.