The third approach to assembly is most widely applied to the short reads from the Solexa and SOLiD platforms. It relies on K-mer graphs, whose attributes make it attractive for vast quantities of short reads. The K-mer graph does not require all-against-all overlap discovery, it does not (necessarily) store individual reads or their overlaps, and it compresses redundant sequence. Conversely, the K-mer graph does contain actual sequence and the graph can exhaust available memory on large genomes. Distributed memory approaches may alleviate this constraint.
The K-mer graph approach dates to an algorithm for Sanger read assembly [24
] based on a proposal [49
] for an even older sequencing technology; see [3
] for review. The approach is commonly called a de Bruijn graph (DBG) approach or an Eulerian approach [50
] based on an ideal scenario. Given perfect data – error-free K-mers providing full coverage and spanning every repeat – the K-mer graph would be a de Bruijn graph and it would contain an Eulerian path, that is, a path that traverses each edge exactly once. The path would be trivial to find making the assembly problem trivial by extension. Of course, K-mer graphs built from real sequencing data are more complicated.
To the extent that the data is ideal, assembly is a by-product of the graph construction. The graph construction phase proceeds quickly using a constant-time hash table lookup for the existence of each K-mer in the data stream. Although the hash table consumes extra memory, the K-mer graph itself stores each K-mer at most once, no matter how many times the K-mer occurs in the reads. In terms of computer memory, the graph is smaller than the input reads to the extent that the reads share K-mers.
] explored problems that genomic repeats introduce. Repeats induce cycles in the K-mer graph. These would allow more than one possible reconstruction of the target. Idury and Waterman [24
] also explored problems of real data. They added two extra types of information to the K-mer graph and named the result a sequence graph. Each edge was labeled with the reads, and positions within each read, of the sequences that induced it. Where nodes had one inbound and one outbound edge, the three elements could be compressed into one edge. This was called the elimination of singletons. Further research led to the Euler software implementation [50
] for Sanger data. Impractical for large-scale Sanger sequencing projects, Euler and the DBG approach were well positioned when the Illumina platform started to produce data composed of very short unpaired reads of uniform size.
Three factors complicate the application of K-mer graphs to DNA sequence assembly.
- DNA is double stranded. The forward sequence of any given read may overlap the forward or reverse complement sequence of other reads. One K-mer graph implementation contains nodes and edges for both strands, taking care to avoid output of the entire assembly twice . Another implementation stores forward and reverse sequence together as cognate half-nodes with the constraint that paths enter and exit the same half . Yet another implementation represents alternate strands in a single node with two sides, constraining paths to enter and exit opposite sides .
- Real genomes present complex repeat structures including tandem repeats, inverted repeats, imperfect repeats, and repeats inserted within repeats. Repeats longer than K lead to tangled K-mer graphs that complicate the assembly problem. Perfect repeats of length K or greater collapse inside the graph, leaving a local graph structure that resembles a rope with frayed ends (); paths converge for the length of the repeat and then they diverge. Successful assembly requires separation of the converged path, which represents a collapsed repeat. The graph contains insufficient information to disambiguate the repeat. Assemblers typically consult the reads, and possibly the mate pairs, in attempts to resolve these regions.
- A palindrome is a DNA sequence that is its own reverse complement. Palindromes induce paths that fold back on themselves. At least one assembler avoids these elegantly; Velvet  requires K, the length of a K-mer, to be odd. An odd-size K-mer cannot match its reverse complement.
- Real data includes sequencing error. DBG assemblers use several techniques to reduce sensitivity to this problem. First, they pre-process the reads to remove error. Second, they weight the graph edges by the number of reads that support them, and then “erode” the lightly supported paths. Third, they convert paths to sequences and use sequence alignment algorithms to collapse nearly identical paths. Many of these techniques derive from the Euler family of assemblers.
5.1. The de Bruijn Graph in Euler
The Euler software was developed for Sanger reads [50
]. It was subsequently modified for short 454 GS20 reads [53
], even shorter unpaired Illumina/Solexa reads [54
], and paired-end Solexa reads [55
Euler applies a filter to the reads before it builds its graph. The filter detects erroneous base calls by noting low-frequency K-mers. The filter relies on redundancy of reads: most true K-mers should be repeated in several reads. The filter also relies on randomness of sequencing error: for any K where 4K
exceeds twice the genome size, most erroneous K-mers should be unique. The Euler filter is implemented with a list of K-mers and their observed frequencies in the reads. The filter excludes or corrects low-frequency K-mers. Correction is especially important for short reads with high error and high coverage. Correction reduces the total number of K-mers and thus the node count in the graph. Correction risks masking true polymorphism [50
]. Correction can corrupt valid K-mers that had low coverage by chance. Correction could leave a read incorrect by settling on K-mers that each occur in several reads but never occur together in any read. OLC assemblers have an analogous base call correction step that uses overlaps rather than K-mers.
The Euler filter process is called spectral alignment [50
]. The software identifies sequencing error by comparing K-mer content between individual reads and all reads. It distrusts individual-read K-mers whose frequency in all reads is below a threshold. The threshold is chosen after calculating the distribution of K-mer frequencies present in the reads. The distribution is usually bi-modal. The first peak represents the many K-mers that occur once or twice, due to sequencing error (or low coverage). The second peak represents the redundant K-mers induced by the read coverage (or genomic repeats). Euler selects the threshold between the peaks and effectively labels all K-mers as bad or good. Euler then examines each read. For each read with bad K-mers, it executes a greedy exploration for base call substitutions that reduce the bad K-mer count [55
]. Finally, it either accepts a fully corrected read or rejects the read. (Rejected reads can be re-introduced to patch low-coverage regions after assembly.) Note Euler corrects substitution errors but not insertions or deletions, i.e. indels. Substitutions are the most common base call error type in Solexa data [12
Euler builds a K-mer graph from the filtered and corrected reads. Then it applies a series of graph manipulations designed to overcome the effects of sequencing errors and genomic repeats.
By processing K-mers not reads, the K-mer graph construction discards long-range continuity information in the reads. Euler repairs this defect by threading the reads through the graph. Mapping a read onto the graph is easy. At least initially, the K-mers in reads map to unique nodes and reads are consistent with some path. Exploitation of the mapping is more complex. Reads ending inside a repeat are consistent with any path exiting the repeat, but reads spanning a repeat are consistent with fewer paths. For the latter, read threading pulls out one piece of string from a frayed rope pattern, thus resolving one copy of the collapsed repeat (). Thus read threading constrains the set of valid paths through the graph. This allows resolution of repeats whose length is between K and the read length.
Figure 4 Three methods to resolve graph complexity. (a) Read threading joins paths across collapsed repeats that are shorter than the read lengths. (b) Mate threading joins paths across collapsed repeats that are shorter than the paired-end distances. (c) Path (more ...)
A paired-end read is effectively a long read that is missing base calls in the middle. Euler treats paired ends this way to resolve repeats longer than individual reads. The technique could be called mate threading. Paired ends that span a repeat provide the evidence to join one path that enters a repeat to one path that exits the repeat (). Paired ends can also resolve some complex tangles induced by repeats. A complex graph may have multiple paths between two nodes corresponding to opposite ends of a mate pair. Each path implies a putative DNA sequence. In many cases, only one of the paths implies a sequence whose length satisfies the paired-end constraint (). Between any mate pair, there could be too many paths for exhaustive search to be feasible. Euler seems to restrict the search space using the mate constraint as a bound on path length. (The number of paths in a general graph scales with NE for N nodes and E out-going edges per node. K-mer graphs of DNA sequence can be constrained to E≤4 to represent the 4 possible one-letter extensions to a sequence. The constraint is violated by some graph simplifications, and N4 is still not tractable.)
After threading, Euler implements graph simplifications at regions with low and high depth of coverage in reads. Euler’s spur erosion reduces branching in graph paths and thereby lengthens simple paths. The spurs are presumed due to sequencing error that survived the spectral alignment filter. Euler identifies remaining edges that appear repetitive and removes them from the set of paths. This is equivalent to breaking contigs at repeat boundaries in OLC assembly.
Reads from many platforms contain lower quality base calls at their 3′ ends, and the problem can be exacerbated by long-read protocols on short-read platforms. Euler addresses this problem by trusting read prefixes more than their suffixes. It chooses trustable prefixes during the error correction step. Prefix length varies per read. During read threading, prefixes and suffixes can map to multiple paths. By a heuristic, Euler trusts mappings that are significantly better than their second-best choice. Just as the suffixes would add coverage to a multiple sequence alignment, they add connectivity to the graph. The extra sequence leads to greater contig size. Euler chooses not to alter the assembly consensus sequence based on the suffixes, so the mapped suffixes contribute connectivity only.
Overlap graphs are sensitive to the minimum overlap length threshold, and K-mer graphs are sensitive to the parameter K. Larger values of K resolve longer repeats but they also fracture assemblies in regions of low read coverage. Euler addresses this with a heuristic. Euler constructs and simplifies two K-mer graphs with different values of K. It identifies edges present in the smaller-K graph that are missing in the larger-K graph. It adds corresponding pseudo-edges to the second graph. The borrowed edges extend paths in the second graph and thus enlarge contigs in the assembly. This technique effectively uses large K-mers to build reliable initial contigs, and then fills gaps with more prolific small K-mers. This is analogous to gap filling approaches in OLC assemblers [37
Some of the Euler software incorporates another structure called the A-Bruijn graph. It gets its name from being a combination of a de Bruijn graph and an adjacency matrix or A-matrix. Nodes of the graph represent consecutive columns in multiple sequence alignments. Compared to nodes representing K-mers in individual reads, the adjacency nodes can be less sensitive to sequencing error. The A-Bruijn graph was deployed for converting a genome sequence to a repeat graph and classifying repeats. It was proposed as a basis for assembly [26
In summary, Euler compares de Bruijn graphs built from different K-mer sizes. Euler applies heuristics to mitigate graph complexity induced by sequencing error. It exploits low-quality read ends and paired-end constraints to tease apart graph tangles induced by genomic repeats. The software targets de novo assembly from short reads, including paired-ends, from the Solexa platform.
5.2. The de Bruijn Graph in Velvet
] is a reliable and easy to use DBG assembler. Velvet makes extensive use of graph simplification to reduce simple non-intersecting paths to single nodes. Simplification compresses the graph without loss of information. Velvet invokes the simplification step during graph construction and again several times during the assembly process. The technique, introduced as elimination of singletons for K-mer graphs [24
], is analogous to unitig formation in overlap graphs [23
] and OLC assemblers [37
Velvet prunes the K-mer graph by removing spurs iteratively. Its tip removal algorithm is similar to Euler’s erosion procedure. The spur removal drastically reduced the graph size on real data [25
], possibly because it was the pipeline’s first attempt at filtering out base call errors. Velvet does not implement Euler’s spectral alignment filter. Velvet has a parameter for the minimum number of occurrences in the reads for a K-mer to qualify as a graph node. The Velvet publication seems to discourage use of this naïve filter.
Velvet reduces graph complexity with a bounded search for bubbles in the graph. Velvet’s tour bus algorithm uses breadth-first-search, fanning out as much as possible, starting at nodes with multiple out-going edges. Since graphs of real data can have bubbles within bubbles, an exhaustive search for all bubbles would be impractical. The search is bounded to make it tractable; the candidate paths are traversed in step, moving ahead one node on all paths per iteration, until the path lengths exceed a threshold. Velvet narrows the bubble candidates to those with a sequence similarity requirement on the alternate paths. Having found a bubble, Velvet removes the path representing fewer reads and, working outside the graph, re-aligns reads from the removed path to the remaining path. Because higher read multiplicity determines the target path, the re-aligner effectively calls the consensus bases by a column-wise voting algorithm. The operation risks “papering over” genuine sequence differences due to polymorphism in the donor DNA or over-collapse of near-identical repeats. Velvet’s algorithm is similar to bulge removal in Euler [26
] and analogous to bubble detection and bubble smoothing in OLC assemblers [28
Velvet further reduces graph complexity by read threading. It removes paths that represent fewer reads than a threshold. This operation risks removing low-coverage sequence but it is thought to remove mostly spurious connections induced by convergent sequencing errors. Velvet exploits long reads, if any were provided, by an algorithm it calls Rock Band. This forms nodes out of paths that are confirmed by two or more long reads provided no other two long reads provide a consistent contradiction.
Velvet’s final graph reduction involves mate pairs. Early versions used an algorithm called breadcrumb [25
] that is similar to mate pair threading in DBG algorithms [52
] and gap filling in OLC algorithms [37
]. It operated on pairs of long contigs (simple paths) connected by paired-ends. Using the long contigs as anchors, it tried to fill the gap between them with short contigs. It gathered short contigs linked to the long contigs, and applied a breadth-first search through the DBG for a single path linking the long contigs and by traversing the short contigs. Later versions of Velvet use an algorithm called pebble [56
]. In this algorithm, unique and repeat contigs substitute for breadcrumb’s long and short contigs, respectively. The unique/repeat classifier is based on read coverage per contig. Velvet uses a statistical test similar to Celera Assembler’s A-stat [37
] and a given coverage expectation. It exploits the given (per-library) insert length distribution to build an approximate contig layout. It searches the DBG for a path that is consistent with the layout.
Velvet may be run several times per data set to optimize selection of three critical parameters. The length of the K-mers is constrained to be odd to preclude nodes representing palindromic repeats. The minimum frequency expected of K-mers in the reads determines which K-mers are pruned a-priori. The expected coverage of the genome in reads controls spurious connection breaking.
In summary, Velvet offers a full implementation of DBG assembly. It does not use an error-correction pre-processor, though it does have an error-avoidance read filter. It applies a series of heuristics that reduce graph complexity. The heuristics exploit local graph topology, read coverage, sequence identity, and paired-end constraints. The software targets de novo assembly from short reads with paired ends from the Solexa platform. An extension allows it to assemble data sets composed solely of SOLiD reads (www.solidsoftwaretools.com
). Memory requirements currently preclude Velvet from assembling very large genomes.
5.3. The de Bruijn Graph in ABySS
ABySS is a distributed implementation [51
] designed to addresses memory limitations of mammalian-size genome assembly by DBG. ABySS distributes the K-mer graph, and the graph computations, across a compute grid whose combined memory is quite large. This scheme allowed it to assemble 3.5 billion Solexa reads from a human donor.
Several problems confront any single-CPU algorithm that is ported to a grid. It must be possible to partition the problem for parallel computation, to distribute the partitions evenly across the grid, and to marshal the results. ABySS partitions the assembly at the granularity of the individual graph node. (Each graph node is processed separately. For efficiency, many graph nodes are assigned to each CPU.) The assignment of a graph node to CPU is accomplished by converting the K-mer to an integer. The formula is strand-neutral such that a K-mer and its reverse complement map to the same integer. The formula would be helpful if it somehow mapped neighbor K-mers to the same CPU. It is not clear to what extent this was accomplished.
ABySS introduces parallel implementations of graph simplifications present in Euler and Velvet. ABySS iteratively removes spurs shorter than a threshold. ABySS performs bubble smoothing by bounded search and prefers the path supported by more reads. It also transforms simple non-intersecting paths into contigs and performs mate threading. ABySS uses a compact representation of the K-mer graph. Each graph node represents a K-mer and its reverse complement. Each graph node keeps 8 bits of extra information: the existence or non-existence of each of the four possible one-letter extensions at each end. The graph edges are implicit in this extra information. ABySS follows paths in parallel starting at arbitrary graph nodes per CPU. From any node, ABySS finds its successor elegantly: the node’s last K-1 bases, plus a one-base extension indicated by an edge, is converted numerically to the address (including CPU assignment) of the successor node. When a path traverses a node on a different CPU, the process emits a request for information. Since inter-CPU communication is typically slow, the process works on other graph nodes while waiting for the response.
In a post-process, ABySS exploits paired-end reads to merge contigs. ABySS does not build scaffolds. In summary, ABYSS is scalable assembly software for Solexa short reads and paired end reads.
5.4. The de Bruijn Graph in AllPaths
AllPaths is a DBG assembler intended for application to large genomes. It was published with results on simulated data [57
] and revised for real data [58
AllPaths uses a read-correcting pre-processor related to spectral alignment in Euler. It trusts K-mers that occur at high frequency in reads and at high quality (where each base must be confirmed by a minimum number of base calls with QV above a threshold). It retains reads whose K-mers are trusted. The filter operates on K-mers for three values of K. The filter is relaxed in two ways. Reads are restored if up to two substitutions to low-QV base calls make its K-mers trusted. K-mers are restored later if they are essential for building a path between paired-end reads.
AllPaths invokes a second pre-processor that creates “unipaths.” The process begins with the calculation of perfect read overlaps seeded by K-mers. It assigns numerical identifiers to the K-mers such that many K-mers seen consecutively in reads and overlaps receive consecutive identifiers. It populates a database of identifier intervals and the read links between them. It merges intervals to an extent that is consistent with all the reads. The operation is equivalent to DBG construction followed by elimination of singletons. The database implementation probably reduces the RAM requirement for the graph construction, which comes next.
AllPaths builds a DBG from the database. Its first graph operation is spur erosion, which it calls unitig graph shaving. AllPaths partitions the graph. Its goal is to resolve genomic repeats by assembling regions that are locally non-repetitive. It applies heuristics to choose partitions that form a tiling path across the genome. It seeds partitions with nodes corresponding to long, moderately covered, widely separated contigs. It populates partitions with nodes and reads linked by alignments and mate pairs. It seeks to fill gaps between paired-end reads by searching the K-mer graph for instances where exactly one path satisfies the distance constraint. For tighter variance on constraints, it uses short-range paired-ends first. AllPaths assembles each partition separately and in parallel. Then, AllPaths “glues” the local graphs where they have overlapping structure. This is analogous to joining contigs based on sequence overlaps. In the global graph, AllPaths heuristically removes spurs, small disconnected components, and paths not spanned by paired-ends. It unrolls cycles to match paired-end distance constraints; this decides copy number on tandem repeats. It also uses paired-ends to tease apart collapsed repeats that display the frayed rope pattern.
In summary, AllPaths targets large-genome assembly using paired-end short reads from the Solexa platform. Its read filter uses quality values to fix some substitution errors. It simplifies its K-mer graph initially based on reads and overlaps. Going beyond read and mate threading, it applies the read and paired-end data outside the graph in a divide-and-conquer approach.
5.5. The de Bruijn Graph in SOAPdenovo
The SOAPdenovo program [59
] generated high-quality mammalian genome sequences from billions of Solexa reads of length 75bp and less [60
]. The program is freely available but without source code.
SOAP filters and corrects reads using pre-set thresholds for K-mer frequencies. It builds a de Bruijn graph and erodes the tips. SOAP threads the reads and splits paths that display a symmetrical frayed rope pattern. SOAP removes bubbles with an algorithm like Velvet’s tour bus, with higher read coverage determining the surviving path. Though SOAP’s DBG implementation borrows from Euler and Velvet, its graph is more space-efficient. Devoid of read-tracking information, the graph required only 120GB RAM to store 5 billion nodes from 3.3 billion reads (after filtering). As described [59
], the graph was constructed without data from large-insert libraries because those were suspected of generating many chimeric reads.
SOAP builds contigs from reads by the DBG method. Then it discards the de Bruijn graph to build scaffolds. It maps all paired reads to the contig consensus sequences, including reads not used in the DBG. Then it builds a contig graph whose edges represent the inter-contig mate pair constraints. SOAP reduces complexity in the contig graph by removing edges that are transitively inferable from others. It also isolates contigs traversed by multiple, incompatible paths, since these appear to be collapsed repeats. Similar techniques are implemented in CABOG and its predecessor, Celera Assembler. As does AllPaths, SOAP processes the edges in order of insert size, from small to large. SOAP does this to preclude construction of scaffolds that interleave others. SOAP uses mate pairs to assign reads to gaps between neighbor contigs within a scaffold. This is similar to CABOG’s “rocks and stones” technique [37
] and to Velvet’s breadcrumbs and pebble techniques. SOAP uses de Bruijn graphs to assemble the reads assigned to each gap. In summary, SOAP is a large-genome implementation that amalgamates OLC and DBG techniques from its NGS assembly predecessors.