We sequenced the genomes of three bacteria (Staphylococcus aureus, Escherichia coli, and Rhodobacter sphaeroides) and two fungi (Schizosaccharomyces pombe and Neurospora crassa) using the Illumina platform (Tables , , , and ; Table S3 in Additional data file 1). In all cases finished reference sequences were available. However, we found what appeared to be biological differences between our isolates and the reference sequences. There were only 11 differences in total for the first two bacteria, but 374 for R. sphaeroides. To provide a control data set for precisely evaluating our bacterial assemblies, we corrected the bacterial reference sequences using data from Illumina, then carefully validated the corrections using data from 454 and Sanger chemistries (Additional data file 1). For the fungi, we did not attempt to reconcile base-level differences, and thus, as compared to our samples, the accuracy of the reference sequences is lower.
ALLPATHS, Velvet and EULER assemblies of five microbial genomes: source data
ALLPATHS, Velvet and EULER assemblies of five microbial genomes: contiguity
ALLPATHS, Velvet and EULER assemblies of five microbial genomes: genome coverage
ALLPATHS, Velvet and EULER assemblies of five microbial genomes: correctness (of chunks approximately 10 kb or less)
ALLPATHS, Velvet and EULER assemblies of five microbial genomes: base accuracy, misassemblies, and long-range validity
The data for the assemblies were of three types: paired 36-base reads [5
] derived from approximately 200-bp fragments, paired 26-base reads derived via a 'jumping' construction from approximately 4,000-bp fragments, and for one genome, additional unpaired 36-base reads. ALLPATHS requires, at a minimum, data from two paired libraries (ideally one with short distance spacing and one long). For the jumping library reads, we used a method (Additional data file 1) that we adapted from a protocol developed for the SOLiD platform [3
]. This method first forms circles from the long fragments, placing a stuffer sequence at the junction site, then digests out the junction fragments using the Eco
P15I restriction enzyme, which has fixed sites in the stuffer. Compared with protocols that work by randomly shearing the circles, this method has the advantage of yielding reads of known length and the disadvantage that the usable read length is only 26 bases.
The data were assembled using version 2 of ALLPATHS [10
], using default arguments for all assemblies. For version 2, we introduced modifications that were essential for accurate handling of real data (Materials and methods). Conceptually, most of the modifications address the fact that genome coverage and coverage quality are sensitive to sequence context. One example is to determine whether a K-mer in the reads should be trusted as correct. Whereas the previous ALLPATHS version simply counted its number of copies in the reads, the new version instead uses quality scores to make the determination.
An ALLPATHS assembly is a graph, as shown in Figure (S. aureus). A graph representation of an assembly consists of edges, representing contiguous and unambiguous sequences of bases, and vertices, representing junction points between edges, and thus branch points. Two edges exiting (or entering) a vertex define an ambiguity, which allows alternative reconstructions of an assembly region. Ambiguities typically arise from unresolved repeat sequences or heterozygous single nucleotide polymorphisms. Ideally, at least one of the reconstructions is correct.
Figure 1 The ALLPATHS assembly of S. aureus. Each edge represents a contiguous and unambiguous sequence of bases and, for this assembly, each component is its own scaffold. Longer edges are in red, short edges in gray. The sizes of the gray edges and regions are (more ...)
The assembly graph may be divided into connected components, between which there are no edges. Using paired reads we may form scaffolds, which are linked sequences of one or more such components, separated by gaps. As part of the output of ALLPATHS we convert these graph scaffolds into traditional, linear scaffolds, which are presented via a fasta file with Ns for gaps. This standard output makes the data compatible with existing analytical tools. In these linear scaffolds, ambiguities (unresolved regions of the assembly graph) are replaced by gaps, entailing some loss of information. For example, the 36-bp repeat of Figure is represented by a gap in the scaffold file, even though we know exactly what sequence is present in the gap, just not the exact number of copies. In contrast, for a true gap we have no knowledge of the missing sequence. Moreover, by treating all ambiguities as gaps we under-represent the true contiguity of the assembly.
Tables , , and quantify key properties of each assembly, including contiguity - how connected it is - and completeness - how much of the genome it covers. We also assessed the quality of the assemblies in several different ways: correctness, base accuracy, misassembly rate, and long range validity.
We devised an objective method to evaluate correctness of assemblies as follows. First, we broke each contig into approximately 10-kb chunks. (More precisely, each contig of size n > 10 kb was broken into
equal-sized chunks, ± 1 base. Contigs of size ≤10 kb were treated as separate chunks.) We then found the 'best' alignment of each chunk to the reference sequence, meaning the alignment that had the smallest total number of substitution and indel bases. We only considered alignments that subsumed a 100 base perfect match to the reference. (The special case where there was no alignment is described below.) From the best alignment we inferred the error rate of the chunk, understanding that these errors could either be errors in the assembly or in the reference sequence. Then we divided the chunks into six classes by their error rate. Class I contains the perfect chunks, class II chunks have error rates up to 0.1%, class III chunks have error rates up to 1%, class IV chunks have error rates up to 10%, and class V chunks have error rates of at least 10%. Class VI consists of chunks that appear not to match the reference at all, presumably due to either contamination of the sample or omissions from the reference, although very short and inaccurate chunks could also be in this category. We report the fraction of assembly bases in each class.
We assayed the 'base accuracy' of the assemblies by considering the error rate for bases in chunk classes I through III (error rate <1%). This translates to a PHRED-like quality score [17
The term 'misassembly' is typically used to refer to a relatively large defect, such as a rearrangement, but can also refer to a significant insertion, deletion or inversion. One way to precisely encapsulate this notion of large defect would be to declare a chunk misassembled if it has a sufficiently high error rate. We took this approach, declaring that a chunk was misassembled if it had an error rate of 1% or higher (that is, it was in class IV or V), reasoning that, given the depth of coverage and high accuracy of the sequencing technology, such a high error rate would most likely be the result of one or more large defects, rather than of isolated sequencing errors. We define the misassembly rate to be the fraction of bases in misassembled chunks. We note that all errors in chunks were accounted for either via base accuracy or misassembly; we took the division point to be at a 1% error rate.
Long range validity
We assayed the long-range validity of the assemblies by computing the probability that sequences at distance 100 kb in scaffolds are, in fact, at about that distance in the genome (Table ).
The ALLPATHS bacterial assemblies are highly contiguous, with N50 contig sizes ranging from 156 to 477 kb, and N50 scaffold sizes from 611 to 2,680 kb. The assemblies are nearly complete: coverage ranges from 98.5% to 99.3%. By all measures, the assemblies are highly accurate. The fraction of chunks (approximately 10 kb) that are perfect ranges from 99.3% to 99.8%. The inferred base accuracy is approximately Q60, that is, about one error in 106 bases. The long-range validity of the assemblies is perfect.
ALLPATHS is designed to present alternatives (graph branches) in cases where the exact sequence of the assembly cannot be determined. For example, as shown in Figure , component 2 of the assembly of S. aureus has two parallel edges representing a 45-base region, one of which matches the reference sequence perfectly, and the other of which has four substitutions. This ambiguity is distinguished from an error, where the assembly presents only the wrong sequence. The corrected reference sequences for the bacterial assemblies are nearly perfect, so we were able to generate a complete catalog of all errors in the ALLPATHS bacterial assemblies. There are only eight in a total of 12.1 Mb, as follows. The S. aureus assembly has exactly four errors: two single base mismatches separated by 11 bases, and two others separated by 23 bases. These events occur in repeat sequences of length >500 bp and >99% identity. The E. coli assembly has exactly one error: a single base mismatch. This event occurs in a perfectly repeated sequence of length 80 bp. The R. sphaeroides assembly has three errors. First, a 234 base deletion from the assembly, adjacent to a repeat. Second, a 10-kb component contains a 6.4-kb edge that matches a plasmid perfectly, but also a 3.4-kb edge that is misassembled and consists of repeated sequence. Third, a 160-kb component joins similar sequence between two plasmids, although all edges in the component match the reference sequence perfectly. This defect does not appear at all in the linearized scaffold.
For the ALLPATHS fungal genome assemblies, contiguity and completeness were lower than for the bacterial genomes. Thus, the N50 contig size for S. pombe was 51 kb, and for N. crassa, 19 kb. A lower fraction of the genome was represented: genome coverage was 95.9% for S. pombe and 89.5% for N. crassa. These assemblies were accurate, although not as accurate as the bacterial assemblies. Base quality was computed to be about Q40, although this is a floor estimate, since errors in the reference sequences or biological differences would have been reported as assembly errors. Long-range validity is very good: the odds of being correctly connected at distance 100 kb in a scaffold are about 99.8%.
Several factors can limit both the contiguity of assemblies and the fraction of the genome represented, including repeat sequence, extremes of GC composition, and non-equimolar sequences such as plasmids. We consider how these factors applied to two of the genomes: R. sphaeroides and N. crassa.
There was 98.5% of the genome present in the assembly; the missing regions consisted mostly of repetitive sequences. In our experience with the Illumina sequencing method employed here, representation decreases significantly for sequences of higher GC composition (Figure S1 in Additional data file 1), such as this genome (69% GC). We therefore sequenced very deeply to compensate for reduced coverage in GC-rich parts. The genome contains plasmids with two additional characteristics that challenged assembly: they are present at higher molarity than the chromosomes and contain long near-perfect repeat sequences.
There was 89.5% of the genome present in the assembly. The 10.5% of the genome missing from the assembly is enriched in repetitive sequences and regions of very low GC composition.
In summary, for bacteria, the ALLPATHS assemblies were markedly complete, contiguous and accurate. The observed base accuracy of less than one error per million rivaled that of finished sequence [18
]. Indeed, there were only five edges with any errors in the three bacterial assemblies. These assemblies were better than the accepted community draft standard. For the fungi, and especially N. crassa
, the assemblies were accurate, but at a lower level of completeness and contiguity.
To understand how the ALLPATHS assemblies would compare to assemblies produced by existing software, we also assembled the identical datasets with Velvet [12
] and EULER-SR [9
], using standardized arguments for each assembler applied to all five genomes. In each case, we initially tested a range of arguments, with the goal of finding a single choice for settings that would optimize assembly quality. We note, however, that some choices optimized continuity at the expense of accuracy whereas other choices did the reverse. For Velvet and EULER-SR, we arrived at a single formula for each that was used in all assemblies presented here (Additional data file 1).
We first compare the results of ALLPATHS and Velvet (see Tables , , and for details). For S. aureus and E. coli, the ALLPATHS contigs were about five times longer than the Velvet contigs, whereas for the other three species, contig lengths were comparable. Scaffolds were longer for ALLPATHS for two species, and longer for Velvet for the other three; however, the ALLPATHS scaffolds were far more accurate. The odds of being correctly connected at distance 100 kb in an ALLPATHS scaffold were 100%, 100%, 100%, 99.8%, or 99.8%, depending on the species, whereas the same odds for Velvet were 45.6%, 86.4%, 75.8%, 80.2%, or 13.6%. In all five cases, the ALLPATHS assemblies were somewhat more complete. The base accuracy of the ALLPATHS assemblies was higher than for Velvet. For the bacterial genomes, where the reference sequences were essentially base perfect, thus enabling highly accurate measurement, we found that the ALLPATHS base quality was approximately Q60, whereas the Velvet base quality was approximately Q34. The reported base accuracies of the fungal assemblies were also higher for ALLPATHS. Finally, misassemblies were also much less frequent in the ALLPATHS assemblies: the misassembly rates for the ALLPATHS bacterial assemblies were 0%, 0%, and 0.3%, whereas those for the Velvet bacterial assemblies were 6.6%, 6.9%, and 3.0%. The ALLPATHS fungal assemblies also had about six-fold fewer misassemblies.
We also report results for EULER-SR assemblies (see Tables , , and for details). We note that EULER-SR scaffolding was minimal, yielding scaffolds of about the same size as contigs. In all cases EULER-SR contigs were shorter than either ALLPATHS or Velvet contigs. The EULER-SR assemblies were also substantially less accurate, in terms of both base accuracy and misassembly rate, than those produced by ALLPATHS and Velvet.
Finally, we carried out a series of 50 assemblies with the goal of understanding whether the results of Velvet or EULER-SR might be improved by using only the reads from approximately 200-bp fragments, thus excluding the shorter 26 base reads from long ('jumped') fragments, that might hinder the performance of the algorithms. For each of the two programs we tried two versions of the code, as well as multiple values of K: for Velvet we tried K = 25, 28, and 31, whereas for EULER-SR we used K = 25 and 28, the maximum allowed value. However, the EULER-SR assemblies with K = 28 terminated prematurely so we were unable to include the results.
Table S5 in Additional data file 2 displays the results of these auxiliary 'jump-free' assemblies. In brief, we note the following. Not surprisingly, scaffolds are much shorter for the auxiliary assemblies. For example, for S. aureus, the Velvet scaffolds are six-fold (or more) shorter than those in the assembly that uses all of the data. In some other ways, some of the assemblies from less data were better. For example, the auxiliary assembly of R. sphaeroides obtained from the older version of Velvet using K = 28 yielded contigs whose N50 size was 19% less than for the assembly of all the data, but which covered 2% more of the genome and were more accurate (0.5% misassembled versus 3%).