We analyzed the performance of AGORA on 369 sequenced bacterial genomes, using error-free de Bruijn graphs generated from the complete genome sequences as previously described in [34
] and optical maps simulated from the sequences. In addition, we also tested AGORA on a published optical map of the Y. pestis
KIM genome [13
]. Note that the error-free de Bruijn graph of a genome sequence of order k is identical to the de Bruijn graph constructed from a collection of error-free sequence reads where every k-mer in the genome is covered by at least one read. The de Bruijn graph of each sequence was simplified by replacing unipaths (a path in which all of the nodes have in-degree
]) with a single edge representing the longer sequence, along with other de Bruijn graph simplifications, which preserve all the information relevant for genome reconstruction from the original de Bruijn graph (see [34
] for further details on the simplification procedures). Moreover, we collapsed parallel edges with greater than 99% sequence similarity, as long as the difference in the sequences did not create or remove any restriction sites (see Methods for more details).
To simulate optical maps from a genome sequence, we first compute an in silico
map of the sequence and then perturb the fragments within this map by sampling from an error distribution. We modeled three different error levels -- highmedium
--- and simulated one optical map from each of these distributions to measure the effect of optical mapping error on assembly quality. Although the error simulation is a simple process which may not capture the full characteristics of experimentally generated optical maps (see Methods for details), the results nonetheless show the impact of noise on the final assembly quality. The high error setting has characteristics matching the maximum fragment sizing error and maximum size of small fragments lost observed in the experimental Y. pestis
KIM optical map, while the low error setting corresponds to what might be achievable with the new nano-coding technology [40
]. The low error setting can noticeably improve the performance of our algorithm since it does not remove small fragments from the optical map. A more precise description of the three levels of noise used in our simulations can be found in the Methods section.
After running AGORA, we used four different metrics to measure the quality of the final path through the de Bruijn graph and the sequence associated with it. (See Methods for detailed descriptions of these measures.) Our first metric, sequence correctness
, roughly corresponds to the percentage of the final sequence that was assembled in the correct order. Our second metric, edge correctness
, is the number of graph edges placed in the correct order divided by the total number of edges in the de Bruijn graph. The last two metrics measuring the final N50 size
and number of contigs
produced by our algorithm are computed after breaking the reconstructed sequence (from the path found by AGORA) wherever an error occurs, and treating sequence segments between errors as independent contigs produced by our assembler. This approach is consistent with the one used by Salzberg et al. [41
] in the context of assembly evaluation. These final contig statistics are then compared against the original N50 size and number of contigs that would arise if one were to treat each edge in the starting de Bruijn graph as a separate contig.
Assembly of bacterial genomes with simulated optical maps
We start by measuring the performance of AGORA on assembling 369 bacterial genomes, providing as input their simplified de Bruijn graphs of order k
100, which are equivalent to the graphs that can be obtained in an error-free sequencing experiment generating reads longer than 100
bp, covering each k-mer of length 100 in the genome. For each genome, we computationally generated BamHI (recognition sequence G^GATCC) in silico
maps for simulating optical maps. Statistics summarizing the number of restriction sites in the in silico
map of each genome and the characteristics of the de Bruijn graphs of the genomes are shown in Table .
Statistics of the de Bruijn graphs and optical maps used in our simulations
Table provides some indication of the complexity of the genomes in our test data set, as measured by their corresponding de Bruijn graphs. The number of nodes in each de Bruijn graph roughly represents the number of distinct repeat sequences longer than 100
bp occurring in the genome, while the number of edges roughly represents the number of times those repeated sequences occur in the genome. Genomes with more nodes and edges in the de Bruijn graph are generally more difficult to assemble, since they contain more repeat sequences.
To determine how well we could assemble the 369 bacterial genomes with the help of optical maps, we ran our algorithm on each de Bruijn graph and optical maps simulated with the three different error settings described above. We then measured the quality of our assemblies based on sequence correctness, edge correctness, N50 size, and number of contigs. The results are aggregated in Figure , and per-genome information is provided in Additional file 1
Figure 1 Assessment of the quality of bacterial genome assemblies. Measurements of the quality of assemblies produced by our algorithm on 369 bacterial genomes under three different optical map error rates. In each boxplot, we extend the whiskers beyond the upper (more ...)
As we can see in Figure a, AGORA assembles over ¾ of all genomes with greater than 98% sequence correctness for all three error settings. The mean sequence correctness in the high, medium, and low error settings were 89.2%, 91.9%, and 95.9%, respectively. The means were lower than the median sequence correctness values due to a few very complex genomes for which the algorithm could only assemble a small fraction of the genome, producing outliers which are not shown in the Figure . When measuring the number of edges assembled in the correct order as shown in Figure b, the edge correctness percentages are lower than the sequence correctness percentages, primarily because edges with short sequences (typically under 1 kbp in length) may be misplaced by our algorithm due to a lack of restriction sites. Although AGORA may misplace 10%-20% of the de Bruijn graph edges, these edges typically contribute to less than 2% of the genome assembled, as indicated by the sequence correctness boxplot shown in Figure a.
In Figure c and d, we plot statistics on the final N50 size and number of contigs that would result if we were to break the final path produced by AGORA wherever a mistake is made, and compare these values with the initial quality of the assembly before using mapping data. We can see in the figures that our algorithm substantially improves the N50 size and number of contigs, even after errors are accounted for. When measuring the overall improvement in N50 size we found that, in the median case, the N50 size increased by a factor of between 3.61 and 4.09, while the mean improvement was between 5.44 and 5.77, depending on the level of mapping error simulated. Similarly, the number of contigs decreased by a factor of between 3.48 and 5.15 in the median case, and the mean improvement was between 6.67 and 10.74. In addition, we found that we had assembled 43, 52, and 69 genomes perfectly into a single contig representing the entire genome sequence in the high, medium, and low settings, respectively.
AGORA finished in under one minute for ¾ of instances, while the longest runtime was around 20 minutes. Note that we forced the algorithm to skip to the next landmark if no path could be found within one minute (see Methods for more details), since our tests required running the algorithm more than 1,000 times. We do not expect this time limitation to significantly affect the median and quartile statistics, as only 18.1% percent of genomes had any regions skipped. Those genomes were generally complex genomes with assembly quality in the lowest quartile, and additional running time did not improve their assembly quality significantly.
In addition to the aggregate statistics shown in Figure , we also individually compared the original and final N50 sizes produced by our algorithm for each genome using the medium error setting. In Figure , we plot a point for each genome at its original and final N50 size, after normalizing by genome length. We see that most genomes have substantial improvement in N50 size, with some genomes even having normalized N50 size starting below 20% and improving to nearly 100% after assembly. Some genomes with normalized N50 size starting below 20% did not improve much, mostly because the corresponding graphs contained many short edges without any restriction sites, making it difficult to place those edges on the optical map and rule out incorrect paths in the de Bruijn graph.
Figure 2 Improvement in normalized N50 size after assembly. For each of our 369 bacterial genomes, we plot the initial normalized N50 size (x axis) relative to the normalized N50 size after assembly (y axis) when provided a simulated optical map with the medium (more ...)
As the normalized N50 size did not seem to predict very well how accurately we could assemble the final genome, we performed further statistical analyses to test whether other factors were more correlated with the final assembly quality. We computed Spearman’s rank correlation coefficient between sequence correctness and various de Bruijn graph characteristics. We found that the sequence correctness obtained by AGORA had the highest correlation with the average edge length of the de Bruijn graph among all the characteristics we measured. The correlations are: genome size (−0.04), normalized N50 size (0.61), N50 size (0.69), number of edges (−0.75), average number of restriction sites per edge (0.76), and average edge length (0.83).
It is not surprising that average edge length, average number of restriction sites per edge, and N50 size have a very strong correlation with sequence correctness as this implies the corresponding de Bruijn graph has long edges which are likely to contain multiple restriction sites. These long edges are easier to place unambiguously along the map and can be used to rule out incorrect de Bruijn graph paths very effectively.
It is important to note that the average edge length and number of edges are strongly anti-correlated (−0.90 Spearman’s coefficient) due to the fact that the genome lengths in our dataset are within a fairly narrow range of 1–5 Mbp (mega base pairs). Given our data, we cannot fully distinguish between the impact of long edges versus fewer edges (lower complexity) on our ability to reconstruct a genome. Genome length also has very low correlation with the sequence correctness of the assembly, but more testing needs to be done on larger and more complex genomes in order to better determine the factors that most influence the quality of genome assembly.
We also directly plotted sequence correctness versus the average edge length in each de Bruijn graph over all three error rates (Figure ) and observed that sequence correctness generally increases with average edge length. Almost all genomes with average edge length greater than 10 kbp can be assembled with accuracy over 98%, even when relying on maps with the highest error rate, while we have mixed results for genomes with shorter average edge length. Also, note the impact of error level in the optical map on the assembly accuracy is less than 2% for genomes with average edge length greater than 10 kbp (which includes 79.9% of our genomes), but has a greater impact on graphs with shorter edges, where the final sequence correctness differs by as much as 40 percentage points between the high and low error settings. Additionally, we find that most genomes with a starting N50 size larger than approximately 50 kbp also yield map-guided assemblies with greater than 98% accuracy. (See Additional file 2
Figure 3 Impact of average edge length of de Bruijn graph on sequence correctness of assembly. A plot showing the sequence correctness of 369 bacterial genome assemblies versus the average edge length of their starting de Bruijn graphs, under three different optical (more ...)
Assembly of Y. pestis KIM with previously published optical map
We also evaluated the performance of AGORA on the assembly of the genome of Y. pestis
KIM (NCBI accession NC_004088) with a PvuII (recognition sequence CAG^CTG) optical map experimentally determined in [13
]. In addition to the experimental optical map, we also ran AGORA on optical maps simulated for PvuII with the same three levels of noise mentioned previously. Since Y. pestis
is a complex genome containing many repeats (primarily IS elements), we provided AGORA with a de Bruijn graph produced with a larger k-mer size of 500 in the initial experiment. Subsequent experiments for k-mer size 100 yield lower quality assemblies, as described in the next section. AGORA took under 5 minutes to find a path matching the genome optical map for each error rate (without having to skip any regions between landmarks due to the 1
minute timeout). The resulting reconstruction of the genome matched the correct sequence with accuracy between 86.74% and 99.13%, depending on the error rate used (see Table ).
Statistics on the assembly ofY. PestisKIM with optical maps of different error rates
Optical mapping information substantially improves the initial N50 size of 62,865
bp (computed from the de Bruijn graph of this genome with k-mer size 500) by a factor of between 6.4 and 18.9 depending on the quality of the optical map. The number of contigs is correspondingly reduced by a factor of between 2.45 and 16.6. While the maximum fragment sizing error and maximum size of small fragments lost in the high error optical map simulation match the values observed the experimentally produced optical map [13
] (10% and 2 kbp sizing error and loss of fragments smaller than 2 kbp), AGORA generates a slightly worse assembly when guided by the experimental map. This indicates that the simple heuristic procedure we used to simulate noise may not adequately match the precise characteristics of the noise seen in experimentally determined optical maps (see Methods for more details). Nonetheless, our results still show the potential impact of noise in the optical on the quality of the final assembly.
To further assess the quality of the assemblies, we used Mummer [42
] to compare the sequences produced by our algorithm to the known genome sequence. Figure a illustrates our previous analysis showing that the sequence generated by AGORA matches the original sequence with greater than 99% accuracy, when given an optical map with low noise. The line along the diagonal indicates sequence that correctly matches the true genome, while only 7 errors can be seen at the locations marked by small circles, which occur due to short misplaced edges. In Figure b, we see that there are more errors in the assembly built using the experimental optical map. The gaps in the line along the diagonal in Figure b indicate roughly 13% of the genome is not correctly assembled by our algorithm. The longest regions of incorrectly assembled sequence occur in portions of the genome where there are few restriction sites.
Figure 4 Mummerplot comparison of assemblies produced with low error and experimental optical map. Two dot plots generated by Mummerplot comparing the known genome sequence of Y. pestis KIM to the sequence assembly produced by our algorithm, when given an optical (more ...)
In these regions, AGORA picks a single path among several possible paths which may match the optical map, possibly leading to errors in the reconstruction. For example, the largest erroneous gap shown in the lower left of Figure b occurs within a 110 kbp genomic region that contains only two restriction fragments of size 40 kbp and 70 kbp, respectively. Within the same region, genomic repeats lead to a fragmentation of the de Bruijn graph resulting in a collection of short edges without any restriction site information, and one edge which contains a single restriction site. The difference in performance on the low error optical map and the experimental optical map highlights the potential benefit of developing higher resolution and more accurate mapping technologies (such as nano-coding [40
]). Alternatively, additional mate-pair information (providing short-range information) along with an optical may also help resolve ambiguities in regions with few restriction sites.
Effect of restriction enzyme choice on assembly quality
We further examined the effect of using different restriction enzymes on the quality of the assembly that can be produced by our algorithm for the Y. pestis KIM genome. We generated a de Bruijn graph with k-mer size 100 for the sequence of Y. pestis KIM, and then computed the in silico map of the genome for each of 102 different restriction enzymes. We then simulated noisy optical maps by adding three different levels of noise to each in silico map, as described previously. In Figure , we plot the number of restriction sites versus the sequence correctness achieved by our algorithm using the optical map for each enzyme at the three different optical map error rates.
Figure 5 Impact of restriction enzyme choice on assembly quality. The choice of restriction enzymes can impact the correctness of the assembly. Each point represents the sequence correctness of an assembly of Y. pestis KIM when given a de Bruijn graph of k-mer (more ...)
The vertical line in Figure corresponds to the number of restriction sites for the enzyme PvuII used to construct the experimental optical map of this genome [13
]. The circles drawn on the line represent the quality of the corresponding assemblies with a PvuII map under different mapping error rates. Although we are able to assemble the genome with 90.3% sequence correctness in the low error setting, the medium and high error settings only assemble with 68.3% and 58.6% sequence correctness, respectively (using the experimental map only yields 48.6% accuracy). Figure illustrates that restriction enzymes that cut more frequently can yield better assemblies. A HindIII (recognition sequence A^ACGTT) optical map with 1,566 restriction sites achieves 99.8% sequence correctness in the low error setting and 98.4% in the medium error setting, as indicated by the green blue squares in Figure , respectively. In the high error setting, we can achieve 66.3% sequence correctness (shown as the red square) with a BSrGI (recognition sequence T^GTACA) optical map with 573 restriction sites. Over the three cases, we can improve the accuracy by between 7.7% and 30.1% by choosing an appropriate restriction enzyme.
Figure also shows the dependence between the frequency with which an enzyme cuts and the quality of the resulting assembly. In the low error setting, assembly accuracy generally increases with the density of restriction sites on the optical restriction map, although this is not true for the medium and high error rates where the performance of the algorithm starts decreasing beyond a certain cut frequency. This phenomenon can be explained by the loss of more small fragments as cut frequency increases, and the increased difficulty of finding landmark edges when there are many smaller fragments of roughly the same size. In the high error setting, we note that restriction enzymes with around 500 recognition sites yield assemblies with the highest sequence correctness for Y. pestis KIM.
The strong dependence of the quality of assembly on the restriction enzyme used highlights the need for choosing an appropriate enzyme. Running preliminary lab experiments to digest the genome with different enzymes can be used to find an enzyme which cuts the genome at an appropriate frequency (in the case of Y. pestis, the ideal restriction enzyme yields an average fragment size of roughly 10 kbp). Alternatively, generating preliminary sequence data and building a corresponding de Bruijn graph, can also help estimate the cut frequency of various restriction enzymes.
Optical maps versus mate-pairs
The use of mate-pairs to guide the assembly process was previously studied by Wetzel et al. [4
] using the same genomes used in our study. A direct comparison to the full results presented previously is difficult to perform as our goal here is the reconstruction of a single contig spanning an entire chromosome, while the work of Wetzel et al. is focused on the resolution of individual repeats (and the corresponding reduction in the complexity of the assembly graph) using mate-pair information. Furthermore, mate-pairs and optical maps provide complementary types of information: mate-pairs provide local information and are most effective in the short range (as shown, e.g., in [4
]) where the optical mapping resolution may be limited, while optical maps provide global information and are particularly effective in the long range (10s-100s of kbp, ranges for which mate-pair libraries are difficult to generate). To demonstrate the complementary strengths of these technologies, we highlight a couple genomes analyzed both with mate-pairs in [4
] and with optical maps in our study.
First, Rhodospirillum rubrum
ATCC 11170 (NCBI accession NC_007643) was completely and correctly resolved by AGORA in our study, but mate-pair based analyses were unable to fully resolve this genome even when trying different combinations of library sizes. We applied the mate-pair repeat resolution approach described in the work of Wetzel et al. [4
] using both the tuned library mixture of sizes 477 and 6047 (see [4
] for details on how the library sizes were chosen), and the ‘standard’ combinations of 2kbp
8kbp, or 2kbp
35kbp. Note that it is possible that some combination of two or more mate-pair libraries could have resolved this genome, as we have not exhaustively explored all possible combinations of mate-pair libraries. However, in practical terms, it is unlikely that a lab interested in solving the Rhodospirillum
genome would attempt multiple library preparations in hopes of finding the perfect combination for this genome.
A second example is the genome of Streptococcus agalactiae
NEM316 (NCBI accession NC_004368) which contains a 47 kbp-long plasmid-like repeat (pNEM316-1) occurring three times within the main chromosome [43
]. Resolving this repeat would require mate-pairs longer than 47 kbp, which are beyond the sizes routinely generated, especially in the context of next generation sequencing technologies (fosmid libraries only extend to ~40 kbp).
Real assembly graphs
Our results have focused on running our proof-of-principle algorithm on ideal de Bruijn graphs obtained from error-free sequencing data. The application of AGORA to data from real sequencing experiments is the object of future work and beyond the scope of this paper. However, it is natural to ask whether our algorithms can feasibly be extended to real datasets. To address this question we focused on sequencing data available for the Yersinia pestis KIM genome, specifically a 454 dataset (SRA accession SRX012379). We assembled these reads using Newbler [Roche] and explored the structure of the resulting contig graph (available from the 454ContigGraph.txt file produced by Newbler).
We compared the Newbler graph to the ideal de Bruijn graphs of order 100 and 500, as the average length of the 454 reads falls between these values at 438
bp. The Newbler assembly resulted in 283 contigs with an N50 size of 38,282 while the order 500 graph had 199 contigs with an N50 size of 62,865
bp and the order 100 graph had 648 contigs with an N50 size of 38,786
bp. Thus, in broad terms, the real assembly graph has similar characteristics to the perfect de Bruijn graphs in our experiments.
More relevant to our study is the question of whether landmark edges can be easily found in the Newbler graphs. The AGORA algorithm critically depends on our ability to find edges that have a unique placement along the optical map. According to this criterion, the Newbler contig graph is also roughly similar to the simulated graphs. Specifically we find 15 landmarks in the Newbler assembly, compared to 15 and 26 landmarks in the order 100 and 500 de Bruijn graphs, respectively. We also aligned the Newbler contigs using the more complex dynamic programming algorithm described in [12
] and identified 22 landmarks, indicating that the use of already existing optical map alignment algorithms will be effective in extending the AGORA algorithm to real sequencing data.