Using several assembly quality metrics, the critical question we wished to address was how do de novo
NGS assemblies compare to the Gallus_gallus-2.1 reference [14
], an assembly based on well-established Sanger data. Our results have validated previous reports [1
] that the assembly of large (>1 Gbp) vertebrate genomes is possible using both 454 and Illumina data. The NGS assemblies discussed herein represent advancements in our ability to assemble and analyze large genomes using NGS, further diminishing the need for solely relying on Sanger sequencing in de novo
genome projects and presenting an opportunity to explore hybrid assemblies that utilize reads from multiple sequencing platforms, especially for existing low coverage Sanger projects.
In spite of ongoing debate on what should be the genome assembly standard in the era of NGS [20
], it is encouraging that our assemblies and others derived from NGS are progressing to higher levels of contiguity and quality, and show promise in identifying novel sequences. This study is the first report to measure changes in single base substitution, insertion, and deletion rates as well as contig order and orientation among NGS assemblies derived from the same DNA source as a published reference. The advantage of this approach is that we can be confident mis-assembly calls are not due to structural variation between individuals. Using discordant paired end mapping and contig alignment methods, we conclude the reference is of higher quality than either NGS assembly. Overall, our estimates of the rate of mis-assembly events within NGS assemblies, as compared to the reference assembly, show an advantage to the lower cost Illumina/SOAP assembly (Table ). In practice, repeat element expansion and organization in the genomes of other more complex species will determine if comparable assembly accuracy is achievable. Importantly, even Sanger based draft assemblies are not complete in the accurate representation of segmental duplications but this is much more a problem in NGS assemblies [21
It is generally accepted that the 454 sequencing method has a diminished ability to accurately measure homopolymer base stretches compared to other platforms. This manifested in our analysis as a higher deletion rate in the 454/Newbler assembly than the Illumina/SOAP assembly and the reference assembled with PCAP, despite the optimization of Newbler to handle this error model by considering flowgrams. That said, Newbler showed a lower deletion rate than the PCAP assembler when applied to the same 454 data set, most likely because PCAP does not consider flowgrams (Table S3 in Additional file 1
]. Interestingly, assembling a combination of Sanger and 454 reads effectively lowers the deletion rate using CABOG [13
], which was optimized for assembling hybrid data (Table S3 in Additional file 1
). Other post-assembly manipulation methods can also be utilized to correct deletion or insertion errors, regardless of read types [22
The discovery of novel sequence not found in the current chicken reference assembly was another important goal of these NGS assembly experiments. The chicken genome is rich with high GC microchromosomes that are typically underrepresented by whole-genome Sanger sequencing approaches compared to the macrochromosomes [14
]. These high GC regions are also known to be gene rich; thus, their under-representation is a possible culprit for initial low gene number estimates [14
]. An important question, then, is whether NGS can be used to recover these GC-rich regions and other sequences not captured in existing Sanger-based draft assemblies. In this study, NGS assemblies uncovered a total of 31 Mbp of non-reference sequence with a high average GC content (54.2%) compared to the autosomal average (41.6%). It appears NGS can be a useful means to capture missing sequences in draft assemblies that were built using Sanger data. The 454 platform has also been shown to be effective in the recovery of sequences from microbial genomes with high GC content (>60% GC) [23
] and in closing gaps in the human genome [24
]. Furthermore, a protist genome project (Leishmania donovani
) utilized Illumina data to close 46% of the gaps in a 454-based assembly, showing that hybrid approaches can effectively leverage the strengths of each platform [25
In terms of gene representation, we observed approximately 93% coverage of the Ensembl gene set in both NGS assemblies, similar to the 89% of RefSeq genes covered by an all-Illumina assembly of the human genome [1
]. This number does not express, however, whether gene footprints are represented contiguously, and we found evidence of high gene fragmentation in NGS assemblies when we reduced our alignment length thresholds. In support of these findings, only 70% of known human genes were found to be in one scaffold of a human sample assembled from all Illumina reads, suggesting extensive disruptions in gene contiguity [21
]. Clearly, there is an increasing need for robust gene modeling algorithms that can take such fragmentation into account. Additionally, the difficulty of chromosomal assignment, ordering and orientating NGS contigs and supercontigs increases in parallel with fragmentation.
While the low repetitive content (approximately 10%) of the chicken genome [14
] limits the direct modeling of assembly quality expectations for genomes with higher repeat complexity, such as mammals, there are several analyses that can be performed equally well on NGS and Sanger-based assemblies. Non-coding RNA transcripts having lengths shorter than typical NGS read and contig lengths can be readily annotated from known non-coding RNA. However, there are an equal number of limitations encountered when using these NGS assemblies. A summary of the assembly algorithm barriers and outcome confines has been presented elsewhere [21
]. One example is the inability to detect the distribution of segmental duplications within the genome, considered crucibles of gene birth [21
The intermediate-sized chicken genome (1.2 Gbp) serves as a good starting point to test and optimize algorithms prior to assembling mammalian genomes. For microbial genomes, short insert libraries are sufficient to produce high quality assemblies. When considering larger genomes, longer reads and libraries with larger insert sizes are necessary to span longer repeats. As this paper was under review, Gnerre et al
] successfully assembled mammalian genomes with greatly improved coverage and accuracy using ALLPATHS-LG. These assemblies cover about 40% of segmental duplication content, compared to about 12% in SOAP assemblies. Although the ALLPATHs-LG algorithm requires specialized libraries to assemble mammalian genomes, including long fragment, short jump, long jump and fosmid jump libraries at high coverage, and a minimum of 90-fold coverage, we are eager to test its effectiveness on a range of complex genomes.
The cost advantage of NGS [4
] has already pushed whole genome sequencing budgets into a more acceptable range for numerous funding agencies, prompting an international consortium of scientists to propose sequencing 10,000 vertebrate species [27
]. With the promise of even longer read lengths from evolving sequencing technology, our ability to create nearly complete genome sequences, even navigating repeat structures that have been resistant to all types of assembly methodology, is moving forward. Efforts to optimize this approach are underway in our lab and many others with the goal of increasing the utility of de novo
assemblies in comparative and experimental studies.