|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Recent advances in single molecule real-time (SMRT) and nanopore sequencing technologies have enabled high-quality assemblies from long and inaccurate reads. However, these approaches require high coverage by long reads and remain expensive. On the other hand, the inexpensive short reads technologies produce accurate but fragmented assemblies. Thus, a hybrid approach that assembles long reads (with low coverage) and short reads has a potential to generate high-quality assemblies at reduced cost.
Results: We describe hybridSPAdes algorithm for assembling short and long reads and benchmark it on a variety of bacterial assembly projects. Our results demonstrate that hybridSPAdes generates accurate assemblies (even in projects with relatively low coverage by long reads) thus reducing the overall cost of genome sequencing. We further present the first complete assembly of a genome from single cells using SMRT reads.
Availability and implementation: hybridSPAdes is implemented in C++as a part of SPAdes genome assembler and is publicly available at http://bioinf.spbau.ru/en/spades
Supplementary information: supplementary data are available at Bioinformatics online.
While de novo sequencing from long and inaccurate SMRT reads results in accurate assemblies (Berlin et al., 2015; Chin et al., 2013), these projects require high coverage of a genome by reads and thus remain expensive. Moreover, for Oxford Nanopore technology (with even higher error rates than in SMRT reads), accurate de novo assemblies remain challenging even in high coverage sequencing projects (Goodwin et al., 2015; Loman et al., 2015). For example, the highest reported accuracy of assemblies from Oxford Nanopore reads (99.5%) is significantly below the acceptable standards for finished genomes.
On the other hand, recently developed hybrid approaches for assembling long (and inaccurate) and short (and accurate) reads proved to be useful in generating high-quality assemblies at a relatively low cost (Deshpande et al., 2013; Koren et al., 2012; Ribeiro et al., 2012). In some complex applications (e.g. metagenomics and single cell genomics), the hybrid approaches may represent an attractive alternative to de novo assembly for long reads.
We describe hybridSPAdes—a hybrid assembly approach that benefits from synergy between accurate short and error-prone long reads. hybridSPAdes uses the same algorithms for both Pacific Biosciences reads (about % error rate) and Oxford Nanopore reads (an even higher error rate), e.g. recently, hybridSPAdes was applied to studies of bacterial plankton using assembly of Illumina and Pacific Biosciences reads (Labonté et al., 2015) and for analyzing antibiotics resistance using assembly of Illumina and Oxford Nanopore reads (Ashton et al., 2015).
We benchmark hybridSPAdes against other hybrid assembly tools (Deshpande et al., 2013; Koren et al., 2012) and demonstrate that it enables accurate assemblies even in the case when the number of long reads is relatively small. We further show that hybridSPAdes works well even in the difficult case of single cell genome assembly resulting in the complete circular chromosome assembly of the elusive Candidate Phylum TM6 (McLean et al., 2013) that remains uncultivated.
While the de Bruijn graph approach currently dominates the short reads sequencing projects, its applications to assembling long reads faces various challenges. Indeed, high error rate in long reads makes it difficult to construct the de Bruijn graph from long reads for any reasonable choice of the k-mer size. As a result, the existing de novo long read assemblers use the overlap-layout-consensus approach instead of the de Bruijn graph approach (Berlin et al., 2015; Chin et al., 2013).
Thus, one has to choose between the de Bruijn graph and the overlap-layout-consensus approaches while assembling short and long reads. SPAdes constructs the de Bruijn graph from short reads and transforms it into an assembly graph (Bankevich et al., 2012; Nurk et al., 2013). The assembly graph is defined as the condensed and simplified de Bruijn graph (Pevzner et al., 2001) of k-mers in reads after removal of bulges, tips and chimeric edges. After SPAdes constructs the assembly graph, hybridSPAdes uses long reads for gap closure and repeat resolution in this graph.
exSPAnder (Prjibelski et al., 2014; Vasilinetc et al., 2015) is a module of SPAdes that utilizes various sources of data (e.g. multiple paired-end or mate-pair libraries) for resolving repeats and closing gaps in assembly. exSPAnder is a modular and easily extendable algorithm based on the path extension framework (Boisvert et al., 2010; Bresler et al., 2012; Zhu et al., 2014). Given a path in the assembly graph, exSPAnder iteratively attempts to grow it by choosing one of its extension edges (all the edges that start at the terminal vertex of this path). The choice of the extension edge is controlled by the exSPAnder decision rule (Prjibelski et al., 2014) that evaluates how well this extension edge is supported by data (e.g. paired reads). Thus, in order to incorporate the repeat resolution by long reads in the exSPAnder framework, one has to represent each long read as a read-path, the path in the assembly graph that spells out the error-free version of the long read. hybridSPAdes uses a new decision rule in exSPAnder that is based on the analysis of these read-paths.
In addition to resolving repeats in the assembly graph constructed from short reads, long reads can also contribute to closing the coverage gaps in this graph. In the case when a coverage gap is spanned by multiple long reads, one can fill in the gap by constructing the consensus of long reads within the gap’s span (Chin et al., 2013).
Overall, hybridSPAdes includes the following steps:
Given a set of short reads ShortReads, is the de Bruijn graph constructed on all k-mers from this set. SPAdes uses various graph simplification procedures (Bankevich et al., 2012; Nurk et al., 2013) to transform the de Bruijn graph into the assembly graph . In this section we describe an algorithm for analyzing how each long read Read traverses the graph resulting in a read-path Path(Read).
Similarly to BLASR tool for SMRT reads alignment (Chaisson and Tesler, 2012), hybridSPAdes selects a seed length t (the default value t = 13) and maps t-mers in long reads to edges in the assembly graphs that contain these t-mers. This information is used to find out how each long read traverses . To answer this question, we first find out how a long read traverses edges in the assembly graph.
hybridSPAdes transforms each long read into a set of t-mers and finds positions of these t-mers on the edges of the assembly graph. Note that t-mers starting at the first positions or ending at the last positions of an edge map to a vertex in the assembly graph. Thus, since such t-mers may be assigned to multiple edges incident to these vertices, we exclude them from further consideration.
Given a t-mer shared by a read and an edge in the assembly graph, we define its t-mer mapping as a triple (e, i, j) where e is an edge in the assembly graph where the t-mer is mapped, and i and j are the positions of the t-mer on this edge and in the read, respectively.
Since there are many spurious t-mer mappings, the fact that a t-mer in a read Read maps to an edge in the assembly graph does not necessarily mean that the read-path Path(Read) traverses this edge. However, our analysis revealed that for nearly all reads, if more than MinSeedNumber t-mers in a read map to an edge in the assembly graph then the read-path traverses this edge (the default value MinSeedNumber=8). We therefore say that an edge in the assembly graph is supported by a read if at least MinSeedNumber t-mers in this read map to this edge.
Consider mappings and of two t-mers from a given read. Define and dgraph as the distances between these t-mers in the read and in the assembly graph, respectively. dgraph is defined as follows: if the mappings share the same edge and i1<i2, the distance is , otherwise it is the length of the shortest path in the assembly graph from the position i1 on edge e1 to position i2 on edge e2.
A mapping is a predecessor of mapping if
The default values of parameters c1, c2 and c3 are 0.7, 1.3 and 500, respectively.
We further construct a directed graph using the set of all t-mer mappings from the read Read as the vertex-set. We connect vertices (t-mer mappings) in this graph by a directed edge if the first one is a predecessor of the second. Since the resulting graph is acyclic (every edge connects a mapping with a smaller read coordinate to a vertex with a larger read coordinate), we can find a longest path in this graph using a fast dynamic programming algorithm.
Next, we determine how the found path through t-mer mappings in the graph traverses long edges of the assembly graph. Since there are many spurious t-mer mappings, we limit attention to the sequence of edges EdgeSequence(Read) in this path that are supported by the given read Read. Note that EdgeSequence(Read) may have some missing edges as compared to the correct read-path Path(Read). Our goal now is to reconstruct these missing edges that often aggregate into complex subgraphs in the assembly graph.
Consider two consecutive edges in the sequence EdgeSequence(Read) that are not consecutive in the assembly graph. Our goal is to figure out how the correct read-path Path(Read) traverses the assembly graph between these edges.
Figure 1 depicts two consecutive edges from EdgeSequence(Read) (shown in red) that are separated by a complex subgraph in the assembly graph. We need to determine which of the alternative paths between these edges in the assembly graph in Figure 1 are traversed by Path(Read).
Given a path Path in a directed graph with edges labeled by characters from a given alphabet, we define String(Path) as the concatenation of labels of the edges from Path. We define the edit distance between strings String1 and String2 as the minimum total cost of substitutions and indels needed to transform one string into another (we assume that every substitution has cost μ and every indel has cost σ). Our goal is to solve the following problem:
A brute-force solution of this problem (in the context of hybrid assembly) is to enumerate all possible paths between two long edges (within a certain range of lengths) and to find a path with the minimum edit distance to the long read. While this approach works for bacterial genomes and is used in the current hybridSPAdes implementation, the number of paths may be exponential in the number of vertices of the assembly graph. Below we describe a polynomial algorithm for solving the Graph Alignment Problem.
Given a labeled directed graph Graph and a string String, we define a graph Graph(String) with the vertex-set corresponding to the pairs where v is a vertex in Graph and . In order to define the edge-set of Graph(String), we specify the incoming edges to the vertex as follows:
It is easy to see that each series of edit operations with total cost score between String and a string spelled by a path from source to sink in Graph corresponds to a path of length score between and in Graph(String).
Therefore, in order to solve the Graph Alignment Problem, we need to find a shortest path between and in Graph(String). Since this graph may have directed cycles, we use the Dijkstra algorithm (Cormen et al., 2001) with the worst case running time , where and are the vertex-set and edge-sets of the graph Graph(String), respectively.
In the case of the hybrid assembly, since there are at most 4 outgoing edges for each vertex in the assembly graph, there are at most outgoing edges for each vertex in the graph Graph(String). Thus, since the total number of edges in Graph(String) is , the running time of the algorithm is , where V is the vertex-set of the assembly graph.
In the context of a typical assembly graph, is much larger than . Also, for the majority of reads, there exists a path of length approximately between source and sink in the assembly graph. Therefore, we can ignore all the vertices of that are farther than from both source and sink while searching for a path with the minimum edit distance.
Although the coverage by short reads is rather uniform in most assembly projects, there are sometimes significant drops in k-mer coverage (Bankevich et al., 2012) and even regions where the k-mer coverage drops to zero. However, these drops in the k-mer coverage rarely affect repetitive edges in the assembly graph since it is unlikely that they occur in all copies of a repeat. Below we focus on gaps in the k-mer coverage that occur within a unique (non-repetitive) region of a genome corresponding a single edge in the assembly graph.
A coverage gap breaks this edge into two edges that we refer to as a sink edge (ending in a vertex without outgoing edges) and a source edge (starting in a vertex without incoming edges). If a long read maps to both a sink and a source edge, then this read can potentially close a gap in the assembly graph. However, a single error-prone long read spanning the gap does not allow one to accurately close a gap, i.e. to reconstruct the nucleotide sequence of the gap. We thus collect the set of all long reads spanning the same pair of sink and source edges (forming the set of reads SpanningReads) and close the coverage gap using the consensus sequence of all these reads.
For each read from SpanningReads, we align this read against the sink edge (ending at position p of the read) and the source edge (starting at position q of the read). The segment of the read from position p+1 to q – 1 represents an error-prone sequence of the gap. hybridSPAdes fills in the gap by solving the Multiple String Consensus Problem (Sim and Park, 2003) for all such segments derived from SpanningReads. To solve this problem, we apply the Partial Order Graph approach (Lee et al., 2002) and use its ConsensusCore library implementation from Pacific Biosciences that proved to work well for SMRT reads (Chin et al., 2013).
The read-against-graph alignment algorithm described above allows one to map each long read to a read-path in the assembly graph. During the repeat resolution stage of hybridSPAdes, we limit attention to paths traversing at least two long edges in the assembly graph. Our goal is to transform this set of paths into contigs that represent the genome assembly. Below we explain how to achieve this goal using the exSPAnder repeat resolution framework (Prjibelski et al., 2014; Vasilinetc et al., 2015). exSPAnder iteratively constructs a set of paths Paths that represent contiguous segments of the genome. In the beginning, Paths is formed by paths consisting of single long edges in the assembly graph. exSPAnder attempts to iteratively extend each path in Paths using its decision rule (see Section 2). If multiple extension edges pass the decision rule for a given path (which usually implies that this path ends in a difficult-to-resolve repeat), exSPAnder stops the extension process for this path.
Given a path P and its extension edge e, exSPAnder defines the scoring function and bases its decision rule on analyzing all values for all extension edges. Below we describe how hybridSPAdes defines .
Read-paths and overlap if a suffix of (i.e. the path formed by the last i edges of ) coincides with a prefix of (i.e. the path formed by the first i edges of ). We define as the longest suffix of that coincides with a prefix of .
A read-path is called trivial if it consists of a single edge and non-trivial otherwise. Since trivial read-paths do not contribute to the repeat resolution, we exclude them from further consideration. Note that there are typically multiple reads with the same read-path, at least in projects with high coverage by SMRT reads. We define multiplicity of a read-path as the number of long reads resulting in this read-path and classify a read-path as reliable if its multiplicity exceeds 1 (SMRT datasets have many chimeric reads that typically have multiplicity 1).
Let ReadPaths be the set of all non-trivial reliable read-paths and ReadPaths(e) be its subset formed by all read-paths containing an edge e. An edge e in is called non-repetitive if
For the datasets with relatively even coverage by Illumina reads (e.g. reads generated from cultured cells but not single cells amplified with MDA) we impose an additional condition—an edge is called non-repetitive if it is sufficiently long (exceeds 500bp in the default setting) and its coverage does not significantly exceeds the median coverage of the entire dataset (does not exceed the median coverage by more than 20% in the default setting).
A read-path ReadPath follows a path P in the assembly graph if there exists a path such that its prefix coincides with P, its suffix coincides with ReadPath, and at least one of edges from is non-repetitive.
Given a path P and a set of read-paths ReadPaths, we define ReadPathsP as the set of all read-paths from ReadPaths that follow P. Given an extension edge e of a path P, we define as the total multiplicity of read-paths in the set , where is the path P extended by the edge e.
If a path P has an extension edge e whose score dominates scores of all other extension edges (i.e. exceeds them by a factor of at least c), hybridSPAdes extends P by e (the default value c=2). Otherwise, the extension procedure stops. If the highest scoring extension edge does not dominate the scores of all other extension edges, exSPAnder applies the standard extension rules based on read-pairs (Prjibelski et al., 2014; Vasilinetc et al., 2015).
We analyzed datasets combining short and long reads from E.coli str. K12 (datasets ECOLI100, ECOLI200 and ECOLI-NANO), M.ruber (dataset MRUBER), Streptomyces sp. PAMC26508 (dataset STREPTO) and candidate division TM6 bacterium TM6SC1 (dataset TM6). The reads in the latter dataset were generated from single cells amplified with the Multiple Displacement Amplification (MDA) technology (Lasken, 2007). Prior to this study, the genome of TM6SC1 was only partially assembled (see McLean et al., 2013 for details).
ECOLI200 dataset contains SMRT reads with 200×coverage and P6/C4 enzyme/chemistry (average read length 5280bp). ECOLI100 dataset contains SMRT reads with 100×coverage and older P4/C2 enzyme/chemistry (average read length 10598bp). ECOLI-NANO dataset contains Oxford Nanopore reads (average read length 6060bp). All three E.coli str. K12 datasets contain Illumina reads of length 100bp, mean insert length 215bp and coverage 230×obtained with Illumina Genome Analyzer IIx.
Mapping of Illumina reads to E.coli str. K12 genome revealed that the strain used for generating these datasets differs from the reference sequence of E.coli str. K12 (three insertions of mobile elements about 1kbp in length). These differences result in six breakpoints that are reported as six assembly errors by the assembly evaluation tool QUAST (Gurevich et al., 2013). We thus ignored these six (pseudo) errors while benchmarking various assemblers.
MRUBER dataset contains SMRT reads with 120×coverage (average read length 2430bp). Illumina reads for this dataset were generated using Illumina Nextera Mate Pair technology (there were no paired-end reads in this dataset) with read length 150bp, mean insert length 3500bp and low 20×coverage.
STREPTO dataset contains SMRT reads with 25×coverage (average read length 1410bp). Illumina reads were generated with Illumina HiSeq 2000 with read length 150bp, mean insert length 280bp and coverage ×. Streptomyces sp. PAMC26508 genome has high (71%) GC content.
TM6 dataset contains both SMRT reads (45×coverage) and Illumina reads (265×coverage) generated from MDA-amplified single cells. The Illumina reads were generated with Genome Analyzer IIx (read length 100bp, mean insert length 270bp). Note that MDA-based single cell approaches result in highly uneven genome coverage by reads (Bankevich et al., 2012).
The links to all the datasets and reference genomes are available in supplementary materials.
We benchmarked hybridSPAdes (as a part of SPAdes 3.6 release), Cerulean (Deshpande et al., 2013) and PBcR (Koren et al., 2012) (version wgs-8.3rc1). In the hybrid mode (that we refer to as hybridPBcR), PBcR uses short Illumina reads to error-correct the long (SMRT or Nanopore) reads. In the self-correction mode (that we refer to as selfPBcR) PBcR only uses long reads for assembly. We used the QUAST assembly evaluation tool (Gurevich et al., 2013) for benchmarking. While QUAST reports many assembly metrics, the benchmarking tables below are limited to NG50, NG75, LG50, the length of the longest contig, and the number of misassemblies (MA), where NG50 is the length for which the collection of all contigs of that length or longer covers at least half of the reference genome, NG75 is defined similarly to NG50 with 75% of reference genome instead of 50%. LG50 is the number of contigs longer or equal than NG50.
Although AllPaths-LG (Ribeiro et al., 2012) has a hybrid mode for assembling short and long reads, we did not have an opportunity to benchmark it since none of the datasets described above satisfy the strict constraints on the insert sizes imposed by AllPaths-LG.
The field of hybrid assembly has been rapidly developing in the last year when the Oxford Nanopore assembly pipeline Nanocorrect (Loman et al., 2015), hybrid Nanopore & Illumina assembly pipeline NanoCorr (Goodwin et al., 2015) and hybrid scaffolder LINKS (Warren et al., 2015) were added to the arsenal of tools for assembling Oxford Nanopore reads. However, Nanocorrect and NanoCorr focused on Oxford Nanopore reads rather than Pacific Biosciences reads. We and others (Ashton et al., 2015; Liao et al., 2015; Utturkar et al., 2014) demonstrated that hybridSPAdes works well for hybrid assembly with both Pacific Biosciences and Oxford Nanopore reads.
hybridSPAdes and selfPBcR assembled both ECOLI100 and ECOLI200 datasets in a single contig (Table 1). As expected, both hybridSPAdes and selfPBcR resulted in six (pseudo) assembly errors caused by the known differences between the analyzed and the reference strains (three insertions of mobile elements). selfPBcR produced two additional (real) misassemblies and hybridSPAdes produced one. Cerulean and hybridPBcR generated more fragmented assembly and, in case of Cerulean, more misassemblies for ECOLI100 dataset. For ECOLI200 dataset, both Cerulean and hybridPBcR generated inferior assemblies.
In addition to hybrid assembly of Illumina and SMRT reads, hybridSPAdes also assembled ECOLI-NANO dataset into a single contig. All other tested assemblers failed to assemble this dataset.
We have also investigated how the performance of hybridSPAdes and PBcR deteriorates when the coverage by long reads is reduced. To perform this analysis, we retained a fixed fraction of randomly chosen SMRT reads resulting in coverage varying from 200× to 6.25×. As Table 2 illustrates, even with low 12.5×coverage by SMRT reads, hybridSPAdes generates a high-quality assembly (better than PBcR with 50×coverage). The quality of PBcR assemblies deteriorates when the coverage falls below 50×.
selfPBcR assembled MRUBER dataset into a single contig with a single misassembly, while hybridSPAdes assembled this dataset into three error-free contigs with zero misassemblies (Table 3). hybridSPAdes failed to assemble this dataset into a single contig because long reads in this dataset do not span over a long 7Kbp repeat. hybridPBCR produced an assembly with quite similar stats. Cerulean produced lower quality assembly, hybridSPAdes generated a high-quality assembly of STREPTO dataset with Kbp (2 misassemblies), while Cerulean generated an assembly with Kbp and 10 misassemblies (Table 4). hybridPBcR failed on this dataset while selfPBcR produced a low-quality assembly due to the low coverage by SMRT reads.
In contrast to the previous assemblies of SMRT reads in single cell genomics (Labonté et al., 2015; Swan et al., 2014) that came short of closing the assemblies, application of hybridSPAdes to TM6 dataset resulted in a single circular contig of length 1089Kbp (which contains all previously sequenced seven long contigs with total length 1075Kbp (McLean et al., 2013)). To the best of our knowledge, it is the first assembly of SMRT reads in single cell genomics that resulted in a complete genome.
Since prior to this study, TM6 genome was incomplete, we used the genome assembled by hybridSPAdes to evaluate performance of other assemblers on this dataset.
Cerulean generated an assembly with the largest contig of length 774Kbp and 1 misassembly (Table 5). hybridPBCR failed on this dataset while selfPBcR produced a low-quality assembly.
Our benchmarking demonstrated that hybridSPAdes improves on the state-of-the-art hybrid assemblers on all datasets we have analyzed (on two of these datasets with a high SMRT read coverage, selfPBcR showed similar results).
Early tools for hybrid assembly combined Illumina and Sanger reads or Illumina and 454 reads (Boisvert et al., 2010; Chevreux et al., 1999; Zimin et al., 2013). However, hybrid assembly of Illumina and SMRT reads presents new algorithmic challenges since SMRT reads have higher error rates than Sanger reads or 454 reads.
Our benchmarking demonstrated that hybridSPAdes assembles short accurate and long error-prone reads into long and accurate contigs. The resulting low-cost high-quality assemblies are important for accurate genome annotations and comparative genomics. Moreover, hybridSPAdes opens a possibility to complete genomes assembled from single cells. Although 1000s of bacterial genomes have been assembled from single cells in the last 3 years using specialized single cell assemblers SPAdes (Bankevich et al., 2012) and IDBA-UD (Peng et al., 2012), finishing genomes amplified from single cells is often viewed as an impossible task (Lasken and McLean, 2014). Moreover, sequencing single cell genomes from SMRT reads is likely to be excessively expensive due to highly non-uniform coverage characteristic of the MDA-amplified datasets. Hybrid assembly of short and long reads, on the other hand, turns complete genome assembly from single cells into reality.
While the detailed analysis of the relative market costs and trade-offs of various sequencing technologies remained beyond the scope of this article, we anticipate that many future sequencing projects will use hybrid assembly of reads generated by various technologies.
We thank Roger Lasken, Mark Novotny and Cheryl Heiner for their expertise that helped to generate the SMRT reads for the MDA amplified TM6 genome. We are grateful to Alla Lapidus, Kira Vyatkina and SPAdes developing team for many thoughtful discussions, suggestions and comments that improved the article. We are also grateful to the anonymous referees whose comments have benefited the article greatly.
This work was supported by the Russian Science Foundation [grant 14-50-00069].
Conflict of Interest: none declared.