Several methods for ultra high-throughput DNA sequencing are currently under investigation (1
). Many of these methods yield very short blocks of sequence information (reads), with some proposed methods giving reads as small as 16 nt. To be useful, these reads must be long enough to provide sufficient information to unambiguously place them on a known template sequence, or to generate unambiguous overlaps to reconstruct the sequence if no template is available. In both these processes repeats cause significant problems. Here we report on an analysis showing the level of genome sequencing possible as a function of read length. We show that re-sequencing and de novo
sequencing of the majority of a bacterial genome is possible with read lengths of 20–30 nt, and that reads of 50 nt can provide reconstructed contigs of 1000 nt and greater that cover 80% of human chromosome 1.
The leading methods for ultra high-throughput DNA sequencing fall into two main categories; sequencing by hybridization (6
) and sequencing by synthesis (5
). Sequencing by hybridization is an extension of well established DNA microarray techniques that essentially aim to identify all subsequences of a specific length within a genome via their hybridization to presynthesized probes. Sequencing by synthesis utilizes a process by which nucleotides are added in a controlled fashion to isolate DNA templates. Each nucleotide is read in turn; each base being added and then read in a cyclic process. Both of these approaches produce sequence information in very short segments compared to conventional Sanger sequencing. For sequencing by hybridization the length of each fragment of sequence information (the ‘read length’) is limited by the length of the probe. Probes cannot be extended much beyond 30 nt as the selectivity for perfect matches over single mismatches drops to unacceptable levels (12
). In the case of sequencing by synthesis obtainable read length is proportional to cycle efficiency. However the technical challenges involved in this approach mean that the length of useful sequence that can be obtained is limited. Solexa Ltd, have recently claimed reads of 25 nt with sufficient throughput to re-sequence the viral genome of
). Another promising technique is the high throughput pyrosequencing approach of 454 Life Sciences who are currently reporting read lengths of ~100 nt (13
A key problem for these sequencing methodologies is that as the length of each individual read decreases, the probability that a read will occur more than once in the sequence increases. The problems that repetitions can cause for sequencing projects based on whole genome shotgun approaches, where the read length is as short as 500 nt, have recently been analysed in detail (14
). There has been much debate concerning the minimum length of read required to generate useful sequence information (3
). However, despite the importance of this analysis for the utility of many proposed ultra high-throughput sequencing methods, little work has been reported on the analysis or reassembly of sequence information from very short reads. Perhaps the most significant contribution to this area has been that of Chaisson et al
) who discusses the limitations of short read sequencing with read lengths starting at 70 nt, the largest genome analysed (Neisseria meningitidis
) is ~2 Mb. In contrast to this, here we describe an analysis for read lengths between 18 and 200 nt, and extend our analysis to the whole human genome. Also rather than showing the result of a particular reassembly tool, our analysis describes the absolute limits of sequence data that can be reassembled.
In a sequencing project based on short reads, repeated sections of the genome will cause several types of problems. If re-sequencing uses a known template sequence, repetitions will prevent an unambiguous assignment of reads to a single position on the template (16
). Therefore the uniqueness of reads within the genome will be a key factor in determining how successful such re-sequencing can be. However, if the aim is de novo
sequencing, a different problem is encountered. In this case repetitions will cause significant problems for assembly of contigs and will severely limit the amount of the sequence that can be effectively reconstructed.