Our final assembly contains one large scaffold with 76 contigs whose total length is 6,290,005 bp, with the longest contig at 512,638 bp. The 436 unplaced contigs, which should fit into the remaining gaps, represent sequence that is unique to PAb1. Our annotation shows that most of these contigs contain genes that are homologous to other Pseudomonas species. Several contigs contain bacteriophage genes, pointing to recent phage insertion events in PAb1. The final assembly thus consists of 512 contigs covering 6,706,902 bp, with 94% of the bases in a single large scaffold. Approximately 9% of the reads were not used in the assembly (); many of these can be mapped to contigs if we use relaxed matching criteria, indicating that they represent low-quality data. Our annotation of the PAb1 genome identified 5,602 protein-coding genes, as compared to 5,568 for PAO1 and 5,892 for PA14.
All Solexa reads have been deposited in the Short Read Archive at NCBI, and the final genome sequence and annotation have been deposited in GenBank as sccession ABKZ01000000.
We have demonstrated that it is possible to sequence and assemble a bacterial genome from deep sequencing using 33 bp reads. The final assembly has 40.3× coverage, with very high agreement among the individual reads at the vast majority of positions in the genome. To measure the accuracy of individual reads, we examined all positions in the assembly with >20× coverage, which yielded 5.9 million positions. If we count as errors any bases that disagree with the consensus at those positions, we get an estimate based on internal consistency that the error rate per read is 1.04%. Based on this estimate, the expected number of errors for regions of the genome with coverage of >20× is close to zero, except for systematic errors such as difficult-to-sequence regions. This illustrates how the great depth of sequencing possible with short-read technology produces higher quality assemblies—in regions with deep coverage—than would conventional Sanger sequencing at a typical 8× coverage depth.
We evaluated the coverage to determine if the Solexa sequences were biased towards any portion of the genome, and found a small bias towards high-GC regions, which comprise most of the genome. In particular, regions with 60–70% GC, which comprised 79% of the genome, had 40× coverage. In contrast, regions with 50–55% GC (1.5% of the genome) had 14× coverage, and regions with <50% GC (1.1% of the genome) had just 5× coverage.
The alignment of P. aeruginosa PAb1 to strain PA14, which matches at 99.4% identity for >90% of the genome, can be used to provide an estimate of the sequencing accuracy. To assess the question of whether differences between our assembly and the PA14 genome represented true differences or sequencing errors, we aligned the two genomes and identified all single nucleotide polymorphisms (SNPs). Out of 5,568,550 aligned bases from the longer PAb1 contigs, 5,537,508 agreed with PA14 and can be presumed correct. For each of the remaining 31,042 SNPs, we examined all reads that were assembled at that point and assessed whether (a) the depth of coverage was adequate, and (b) the PAb1 reads agreed on the consensus base. The coverage was 10-fold or greater for 95% of these SNPs. Using the conservative assumption that a SNP might be in error if the inter-read agreement was less than 80%, we found 1157 positions (out of 5,568,550) that might be sequencing errors. We also found 1104 insertions and deletions (indels) in the aligned regions, and our assembled reads were in perfect agreement for 917 of these. If we assume conservatively that the other 187 indels are errors, then considering both SNPs and indels, the accuracy of the assembled genome is greater than 99.97%.
The assembly is sufficiently complete that we can confidently infer that genes are missing if their expected positions fall in the midst of contigs. Although deeper analysis will be presented in a followup paper, we note that the PAb1 strain is known for its hypermotility on low percentage agar media. Our sequence contains most of the genes required for swimming motility in
P. aeruginosa [26], but is missing part of the pathway used by cyclic-di-GMP, a secondary signaling molecule, that has been implicated in repressing swimming motility
[27],
[28]. By searching all of the known
P. aeruginosa genes in this pathway
[29],
[30],
[31], we found that three genes encoding diguanylate cylase and phosphodiesterase are missing: PA2771 and PA2818 (
arr) from the PAO1 strain, and PA14_59790 (
pvrR) from the PA14 strain
[32],
[33]. All three of these genes are located in chromosomal regions previously indicated as hyper-variable based on genomic hybridizations
[29]. The altered gene content of PAb1 in the regulatory pathways repressing flagella may contribute to its observed hypermotility.
The new algorithm described here make it possible for any scientist to acquire the entire genome of a bacterium at high speed and very low cost. One limitation of our method is that it depends on the existence of related genomes (for the comparative assembly step) and protein sequences (for the gene boosting step). However, GenBank already contains the complete genome sequences for >650 microbial genomes, and draft sequences for nearly 1000 more. For many of these species, much larger numbers of related strains and species have yet to be sequenced. Our method opens the door to the use of whole-genome sequencing to study entire collections of bacteria, to rapidly identify genotypes from mutagenized genetic screens, and for other analyses that were previously too costly or technically infeasible. The gene-boosted assembly technique applies equally well to both short and long-read sequencing methods, and should also work for assembling the gene-containing regions of much larger genomes.