Dramatic advances in sequencing technologies have opened new possibilities for whole genome analysis. The increasing read length of next-generation sequencing platforms, as well as the promising perspectives of third generation sequencing platforms, will inevitably lead to better assemblies and represent genomes in large stretches of DNA. Also, third generation technologies (such as the PacBio and IonTorrent systems) will be capable of outputting sequencing reads with large undefined inserts, thus providing valuable paired read information for the assembly and scaffolding process. Concurrently, the development of genome closure software should also receive attention to overcome difficult genomic regions that cannot be covered by draft assemblies.
Our results with GapFiller indicate that gapped genomics regions can be reliably closed through an automatic protocol that uses only short sequencing reads. Costly Sanger sequencing can therefore be limited to a few difficult repeat areas. Also, we show that the method is suited for both bacterial and (large) eukaryotic datasets in terms of accuracy, time and memory usage. In the human dataset we underscore that gapped regions may contain diverse but crucial functional information, which is missed in the draft assembly.
From in-depth investigation of closed regions with GapFiller and SOAPdenovo, we argue that our GapFiller method has three main advantages. First, it takes into account the estimated gap size prior to closure, thus discarding erroneous closures that are shorter or larger than expected. Second, it does not try to fill a gap through local assembly of reads into contigs, but rather seeks to extend the contigs from each end through k-mer overlap. The latter point is especially important to overcome short tandem repeats. And third, it takes into account the contig edges, often a source of misassemblies, and re-evaluates these during gap closure. Importantly, GapFiller requires only limited computational resources and thus is also suited for (larger) eukaryotic genomes. Also, SOAPdenovo shows a good performance in terms of speed and memory usage and appears a valuable alternative to our strategy. From the results obtained on both bacterial and eukaryotic datasets, it appears SOAPdenovo is better able to close larger gaps. The explanation might reside in the fact that GapFiller implements a more conservative strategy than SOAPdenovo. Given that larger gaps usually represent more complex regions than smaller gaps, GapFiller may be simply more careful in avoiding dubious closures. This is also in line with the observation that GapFiller yields generally more accurate results than SOAPdenovo. Alternatively, a different explanation for GapFiller being less effective in closing larger gaps could be provided by the first advantage mentioned above. If the estimated gapsize in the scaffold does not meet the size of the closed fragment, such closure will be rejected. However, from a practical point of view it is still relatively difficult to estimate the exact gap size based on next-generation sequencing paired-read libraries. In particular, mate pair libraries can show a relatively large insert distribution, thus making it hard to define the correct distance between the pairs. Also paired-end background noise in mate pair data is commonly observed and leads to incorrect scaffolds. Consequently, there is a possibility that a correct closure is rejected because of the erroneous gap size estimation after scaffolding. Future research will hopefully lead to better insight into these issues.
The IMAGE method instead has several limitations in terms of quality, speed and ease of use (the method can handle only one library per analysis). Notably, the strategy produces relatively large genomes, and this is most likely a consequence of sequence extension of the left and right scaffold edges.
The findings in this paper have been derived using different short-read Illumina libraries. The length of the gapped regions is limited to the insert distance of paired reads. Nonetheless, the promising outcomes strongly indicate that long-range gaps can also be effectively closed with high quality 454 and/or single molecule paired sequence reads. Also, we have put emphasis on creating a user-friendly and fast tool so that it can be of wide use for the community. We feel GapFiller can make a significant contribution to (almost) closing genome assemblies in a reliable manner.