When discussing genome assembly it is important to distinguish between de novo
approaches, aimed at reconstructing genomes that are not similar to any organisms previously sequenced, and comparative (re-sequencing) approaches that use the sequence of a closely related organism as a guide during the assembly process. Mathematically, the de novo
genome assembly problem can be proven to be difficult, falling within a class of problems (NP-hard) for which no efficient computational solution is known [20
]. In contrast, comparative assembly is a much easier task—it is essentially sufficient to align the set of reads to the reference genome in order to characterize a newly sequenced organism. The two approaches are not exclusive—even if a reference genome is available, regions of the sequenced genome that differ significantly from the reference (e.g. large insertions) can only be reconstructed through de novo
assembly. The characteristics of the sequencing technology being used also restrict the choice of assembly strategy. Short reads and the absence of mate-pairs make it difficult for de novo
assemblers to resolve repeats, while the large number of reads generated by new generation sequencing machines leads to efficiency issues. Due to these challenges, short-read sequencing technologies (Solexa, SOLiD) have primarily been used in re-sequencing applications. De novo
assembly of these data has largely been restricted to bacterial genomes, though an assembly of an entire human genome from Solexa reads was recently reported [22
]. Irrespective of genome size, de novo
assemblies constructed from short-read data are highly fragmented [22
]. This approach is, therefore, better suited for long reads or for combinations of data from multiple sequencing technologies [24
De novo methods, however, are essential to our efforts to characterize the biological diversity in our world. Comparative approaches can only be applied to the few genomes for which reference sequences are available. In this section we will primarily focus our discussion on the main strategies for de novo assembly—greedy, overlap-layout-consensus and Eulerian—and conclude with a brief overview of comparative approaches.
A key component of the greedy and overlap-layout-consensus (OLC) strategies is a module, called an overlapper that computes all pairwise alignments between a set of reads. The computation of overlaps is one of the most time-intensive components of an assembly program requiring, in the worst case, time proportional to the square of the number of reads provided to the assembler (each read must be compared to all other reads, leading to (n2). In most practical cases the performance is much better—running times roughly proportional to the number of reads can be achieved through simple indexing strategies. An example are indices based on exact matches of length k (k-mers)—only the reads sharing a same k-mer need to be compared when computing overlaps. Note that a quadratic behavior can still be observed in repeat regions—all reads sharing a same k-mer must be compared to each other. For this reason, many assemblers exclude from the index k-mers that appear repetitive, i.e. those contained in more reads than expected given the sequencing coverage.
Overlap computation can be easily parallelized, allowing the running time to be reduced through the use of multi-processor machines (most modern desktops contain several processors) or computational grid resources (available at most major research institutions). It is important to note that the use of short-read sequencing technologies has a significant impact on the complexity of overlap computation, in particular due to the increased number of reads generated in a short-read sequencing project—such projects generate an order of magnitude more reads than were commonly generated in Sanger-based projects.
Greedy algorithms represent the simplest, most intuitive, solution to the assembly problem. Individual reads are joined together into contigs in an iterative fashion, starting with the reads that overlap best, and ending once no more reads or contigs can be joined. By overlap we mean that the prefix of one of the reads shares sufficient similarity with the suffix of another read (A). The definition of overlap quality depends on implementation, commonly used measures being the length of the overlap and the level of identity (percentage of base pairs shared by the two reads) between the reads within the overlapping region. The term ‘greedy’ refers to the fact that the decisions made by the algorithm optimize a local objective function (in the case of assembly, the quality of the overlap between two reads), approach that may not lead to a globally optimal solution. For example, by always processing the best overlap first, a greedy assembler may misassemble repeats (B and C).
Figure 1: (A) Overlap between two reads—note that agreement within overlapping region need not be perfect; (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C) Assembly produced by the greedy approach. Reads A and (more ...)
The variant of the greedy approach outlined above was used by several of the most widely used genome assemblers for Sanger data, such as phrap, TIGR Assembler and CAP3. Recent software aimed at assembling short-read sequencing data (SSAKE, VCAKE and SHARCGS), use a different greedy strategy. An unassembled read is chosen to start a contig, which is then repeatedly extended by identifying reads that overlap the contig on its 3′ end until no more extensions are possible. The process is repeated in the 5′ direction using the reverse complement of the contig sequence. The assembly continues in an iterative fashion by scanning through the unassembled reads. The reads are considered in decreasing order of their quality, as defined either by depth of coverage (number of reads confirming a section of a given read) [25
] or by using a combination of coverage, quality values and the presence of at least one perfect overlap with another read [26
]. To avoid misassemblies the extension process is terminated once conflicting information is found, i.e. two or more reads that could extend the contig; however, these reads do not overlap each other (D).
This strategy breaks down the assembly into three distinct steps in order to enable a global analysis of the relationships between the reads unlike the inherently localized approach taken by the greedy approaches. The first step is the same as for the greedy approach—the reads are compared to each other to construct a list of pair-wise overlaps. This information is then used to construct an overlap graph—a graph containing each read as a node, and an edge connects two nodes if an overlap was identified between the corresponding reads. During the layout stage, the overlap graph is analyzed in order to identify paths through the graph that correspond to segments of the genome being assembled. The ultimate goal is a single path that traverses each node in the overlap graph exactly once, corresponding to a reconstruction of the genome using all the sequencing reads provided as input to the assembler. Note that in the general case identifying such a path (called a Hamiltonian path) is computationally difficult, falling in the same class of NP-hard problems as the general de novo
assembly problem. Nevertheless, the graph formulation for assembly enables several analyses not possible in the greedy setting. For example, the genome represented in the overlap graph shown in can be inferred to contain two copies of segment B separated by a copy of segment C (the genome ‘reads’ ABCBD). The greedy strategies described above would either construct a fragmented assembly that breaks at the boundary of the repeat B (a possible reconstruction is AB, C, D) or incorrectly traverse the repeat, leading to a misassembly, i.e. reconstructing a DNA segment not actually present in the genome being assembled (e.g. ABD, C). The final step in the OLC strategy is consensus computation—determining the DNA sequence that is implied by the arrangement of reads along the chosen path through the graph. This latter step is performed by allowing the reads that overlap a same base within the reconstructed genome to vote on the identity of the base, the votes being weighted by sequence quality values [27
Figure 2: Overlap graph of a genome containing a two-copy repeat (B). Note the increased depth of coverage within the repeat. The correct reconstruction of this genome spells the sequence ABCBD, while conservative assembly approaches would lead to a fragmented (more ...)
The layout stage, the core of the OLC strategy, merits further discussion. Practical implementations of OLC assemblers use a hierarchical approach for the layout task. First, segments of the overlap graph are identified that can be unambiguously assembled, representing segments of DNA that are definitely present in the assembled genome—generally DNA segments that do not contain repeats, or that are entirely contained within a single copy of a repeat. For example Celera Assembler defines the concept of a unitig (uniquely assemblable contig) that can be constructed by following overlap edges until encountering a ‘fork’ in the graph. By fork we mean a read A that overlaps two other reads, B and C; however, B and C do not overlap each other. Such a situation often represents the boundary between a repeat and the genomic regions adjacent to the copies of this repeat throughout the genome; however, forks can also be caused by sequencing errors (e.g. the overlap between B and C is obscured by errors in one or both of the reads). Forks caused by errors can be resolved through simple heuristics, e.g. by removing ‘spurs’—single reads that disagree with the bulk of the reads within a region of the assembly graph [8
] —or by adding to the graph the overlaps that were ‘hidden’ by sequencing errors [29
]. Unitigs can also be constructed with the help of mate-pair information—Arachne seeds the assembly process with groups of mate-pairs that overlap at both ends (paired pairs) [29
Unitig construction is intentionally conservative and rarely results in misassemblies with the exception of potentially collapsing multiple copies of a repeat into a single unitig. The resulting set of unitigs is, however, highly fragmented as unitigs cannot cross repeat boundaries. A unitig-only assembly is sufficient for some applications, e.g. for Chip-Seq experiments [30
] or in transcriptome sequencing [31
]. In such cases unitigs can be obtained with standalone assemblers such as Minimus [32
] or newbler (assembler provided with the 454 sequencing instrument), or by examining the internal datastructures of Celera Assembler or Arachne.
For the assembly of whole genomes, the unitig reconstruction is just a first step in the assembly process and is followed by more sophisticated analyses of the assembly graph aimed at reconstructing larger contiguous segments of the genome being assembled. For example, Celera Assembler and Arachne use mate-pair information to identify sets of unitigs that could be merged into larger structures. Further, algorithms based on network flow analysis have been proposed to identify paths through the assembly graph that accurately account for the copy number of repeats [21
], leading to correct reconstructions of larger segments of the genome.
The OLC strategy is arguably the most successful assembly strategy in a practical setting. The majority of the large genomes sequenced in recent years through shotgun sequencing has been assembled with either Celera Assembler or Arachne, and the rapid adoption of the 454 sequencing technology is in part due to the performance of the newbler assembler. The OLC approach was also successful for the assembly of data with very short read lengths (e.g. Solexa)—the OLC assembler Edena was shown to outperform several greedy assemblers (SSAKE, SHARCGS), as well an Eulerian path assembler (Velvet) [28
]. In part, this success is due to the inherent flexibility of the OLC strategy. The approach is inherently modular and the overlap graph structure lends itself to multiple types of analyses. It is important to note that most OLC assemblers allow users to extract graph information, providing the opportunity for additional downstream analyses (e.g. this information could guide the design of finishing experiments [34
The Eulerian path strategy was inspired by early work on sequencing by hybridization (SBH) [35
]. The output of an SBH experiment is a list of all oligomers of a given length k
present in the genome being sequenced. This information is also referred to as the k
-mer spectrum of a genome (A) and is, conceptually, more valuable than the information produced by a shotgun sequencing experiment. In fact, the k
-mer spectrum can be viewed as the output of a perfect shotgun experiment where the reads all have equal length k
and perfectly sample the genome, with a read starting at every single base. The SBH technology was ultimately not successful due to technical challenges, partly because it is simply too difficult to experimentally interrogate every single k
-mer in a genome with the exception of small values of k
(corresponding to very short sequencing reads). However, the theoretical analysis of the problem of assembling a genome from SBH data has resulted in a new strategy for assembling a genome. This Eulerian path approach starts by breaking up the set of reads into their k
-mer spectrum, effectively reconstructing the information that would be obtained in an SBH experiment. In fact, the data extracted from reads also carries additional information—the abundance (coverage) of each k
-mer, information that can be useful in resolving repeats. The resulting k
-mer spectrum is then used to construct a deBruijn graph—a graph containing as nodes the k
− 1 length prefixes and suffixes of the original k
-mers, two nodes being linked by an edge if the adjacent k
− 1-mers have an exact overlap of length k
− 2 (B). In effect each k
-mer identified within the set of reads is converted into an edge of the deBruijn graph, and the assembly problem becomes equivalent to finding a path through this graph that uses every edge in the graph (we know each k
-mer is present in the genome, therefore the corresponding edge must be included in the reconstruction). Such a path is called a Eulerian path, hence the name for this assembly strategy.
Figure 3: (A) k-mer spectrum of a DNA string (bold) for k = 4; (B) Section of the corresponding deBruijn graph. The edges are labeled with the corresponding k-mer and (C) Overlap between two reads (bold) that can be inferred from the corresponding paths through (more ...)
Intuitively, the Eulerian approach offers several advantages over the OLC strategy. First of all, pairwise overlaps between reads are never explicitly computed, hence no expensive overlap step is necessary, rather overlaps are implicitly represented in the deBruijn graph by paths that traverse from one read to its neighbor (C). Furthermore, efficient algorithms exist for finding a Eulerian path in a graph, in contrast to the OLC approach that leads to the difficult task of finding a Hamiltonian path. This intuition, however, ignores the fact that, in general, an exponential number of distinct Eulerian paths can be found in a graph, corresponding to the many different ways a genome can be rearranged around its repeats. Ultimately, the task of an assembler is to find just one of the possible paths, corresponding to the correct reconstruction of the genome. Adding additional constraints to the Eulerian path approach in order to guide the algorithm toward this one correct path, leads to a considerably harder computational problem [21
]. In fact, simple transformations of the overlap graph from the OLC paradigm leads to a graph structure (string graph) that is functionally equivalent to the deBruijn graph of a genome [33
], hinting at the equivalence between these assembly paradigms.
As described above, the construction of the deBruijn graph leads to a loss of information—chopping up the reads into a set of k
-mers results in a loss of long-range connectivity information implied by each read. This information is critical in reducing the ambiguity in the graph structure caused by short repeats. In order to incorporate read information in the assembly process, Pevzner et al
] formulated a new variant of the Eulerian path problem called the Eulerian superpath problem—the task of finding a Eulerian path through the graph that is constructed from sub-paths corresponding to individual reads provided as input to the assembler.
The Eulerian strategy has been proposed as an alternative to OLC for the assembly of Sanger data and was implemented in the Euler series of assemblers [36–40
]; however, was not widely adopted, in part due to its sensitivity to sequencing errors. Errors lead to the creation of ‘new’ k
-mers and dramatically increase the size and complexity of the resulting deBruijn graph, leading to the need for sophisticated error correction algorithms [36
]. New sequencing technologies, in particular those generating short reads, are a better target for the Eulerian strategy. These technologies generate high depths of coverage in reads that are roughly equal in length, effectively generating the same information that would be produced by an SBH experiment. Two recently developed assemblers specifically targeted at short-read sequence data use the Eulerian strategy: Velvet [41
], and Allpaths [42
None of the assembly strategies described above can completely reconstruct a genome from read data alone, the output of most assemblers consisting of a (often large) collection of independent contigs. Other sources of information can be used to determine the relative placement of these contigs along a genome in a process called scaffolding. The output of this process consists of a series of scaffolds—groups of contigs whose relative placement is known even though the DNA sequence of the genomic regions connecting adjacent contigs is unknown. Most commonly, scaffolding relies on mate-pair information. Two contigs can be inferred to be adjacent in the genome if one end of a mate-pair is assembled within the first contig, and the other end is assembled within the second contig. In practice, two or more such links are required between two contigs in order to reduce the impact of experimental errors. Like de novo
assembly, scaffolding can be shown to be computationally difficult [43
]; however, simple heuristics perform well in practice. All modern assemblers, irrespective of the underlying assembly paradigm, contain a scaffolding module: Euler [37
], Arachne [29
], Celera Assembler [8
], Velvet [41
]. Stand-alone scaffolders are also available, such as Bambus [44
] allowing mate-pair information to be added to virtually any assembler.
Scaffolding algorithms often follow a greedy approach, starting with the most reliable information then iteratively incorporating more data as long as the new information does not conflict with the already constructed scaffold. In Bambus [44
] the order in which mate-pairs are processed is specified by the user either by library (e.g. process clone libraries first, followed by fosmids, then BAC-ends), or by the number of links connecting adjacent contigs. In Celera Assembler [8
], unique contigs that are well connected to the rest of the assembly (called rocks) are processed first, followed by contigs that are not only connected by one mate pair but also overlap an adjacent contigs (stones), followed by unconnected contigs that can be used to tile across gaps in the assembly (pebbles). Velvet [41
] starts the scaffolding process with contigs longer than the mate-pair size, using the deBruijn graph information together with mate-pairs to ‘walk’ across the gap between these contigs. Note that in the Eulerian approach, mate-pairs define paths through the graph that need to be traversed in a reconstruction of a genome, thus mate-pair information can be processed with the same algorithms used to solve the Eulerian super-path problem [37
Scaffolding information can also be obtained from whole-genome mapping data. In particular, optical mapping [45
] has been successfully used in this context. In brief, optical mapping can determine the approximate location of restriction enzyme cuts along a genome, thereby generating an ordered list of restriction fragment lengths along the genome. Such information can be used to identify the location of assembled contigs within the genome, leading to a single genome-wide scaffold. The scaffolding process, implemented in the program SOMA [46
], involves constructing in silico
restriction maps from a set of contigs, then mapping these contigs along a genome-wide optical map. The initial application of this approach to bacterial genomes has resulted in scaffolds that cover up to 80–90% of the entire genome sequence [46
Mate-pair and optical mapping data provide complementary information to the scaffolding process. Mate-pairs can offer high-resolution but inherently local information while optical maps generate a global, low-resolution view of the genome. These technologies can be effectively combined, for example, by using an optical map to anchor a set of large contigs along the genome, then using mate-pair information to complete the gaps within the scaffold. Note that both mate-pair and mapping-based scaffolding approaches have difficulties scaffolding short contigs and may, therefore, be difficult to apply to fragmented assemblies generated from short-read sequencing data.
Often the genome being sequenced is closely related to a genome that has been previously sequenced. This is the case, for example, when studying genomic variation within a population, e.g. in human resequencing experiments, or when sequencing multiple strains of a same bacterium. The available reference sequence can be used to guide the assembly of a genome using a process called comparative, or templated assembly. Briefly, the reads are mapped to the reference genome and their placement is used to infer the structure of the genome being sequenced (target genome). In this process care must be taken to avoid obscuring differences between the two genomes. For example, the comparative assembler AMOScmp [47
] identifies locations in the reference where the alignment of multiple reads ‘breaks’ and flags these regions as the location of possible insertions or deletions between the two genomes. In the context of resequencing experiments using short-read sequence data, the program Maq [48
] uses a probabilistic framework to assign a quality value to each read alignment, then uses this information together with sequence quality data in order to characterize single nucleotide polymorphisms (SNPs) between the target genome and the reference. Furthermore Maq assumes that the target genome is haploid, thus each SNP may represent one of three possible genotypes (aa, ab, bb).
Finally, the comparative approach can also be applied when the reference is a protein sequence. In this case the goal is to reconstruct an individual gene rather than organism. Such an approach is implemented in the assembler ABBA [49
]. Here the reads are translated in all six reading frames then aligned to the sequence of a protein of interest and the alignment information is used to guide the assembly process. Since protein sequences are conserved across longer evolutionary distances than DNA, protein-guided assembly can be successful even if no closely related reference sequences are available. Further, this approach is effective even for very short reads (25 bp/eight amino acids) as long repeats are unlikely to occur within the span of a single gene.