A major challenge facing us in this post-genomic era is how to extract maximum information from completed genome sequence assemblies (1
), so as to address basic questions in gene annotation, expression profiling, gene regulation and genome variation.
The sequencing approach has clear advantages over microarrays by elucidating the exact nucleotide content of target DNA sequences. However, a major constraint has been its higher cost and lower data-generation speed relative to microarrays. As an improvement on methods involving one template per read, serial analysis of gene expression (SAGE) was developed (2
). This strategy utilizes short DNA tags representing an entire DNA fragment, and the concatenation of these tags for efficient sequencing enables the characterization of whole transcriptomes and genomes. However, the mapping of short single tags to the genome often results in positional ambiguities. This drawback was partially addressed in recent modifications that specifically extracted 5′ terminal signatures of cDNA (4
), but it was the simultaneous tagging of both 5′ and 3′ terminal signatures that provided an ideal solution. To achieve this, we initially developed an intermediate approach that separately extracted 5′ and 3′ terminal tags from cDNA fragments for sequencing (6
). Subsequently, we developed gene identification signature (GIS) analysis, in which the 5′ and 3′ signatures of each full-length transcript were simultaneously extracted, then covalently-linked into paired-end ditag (PET) structures for concatenated high-throughput sequencing and the accurate demarcation of transcriptional unit boundaries in assembled genome sequences (7
). An average capillary sequencing read (700–800 bp) of a single GIS-PET library clone would reveal 10–15 PET U, thus representing a 20- to 30-fold increase in annotation efficiency compared to the bidirectional sequencing analysis of full-length cDNA (flcDNA) clones.
We have also successfully applied this PET-based DNA analysis strategy to characterize genomic DNA fragments enriched for specific target sites by chromatin immunoprecipitation (ChIP), and these chromatin immunoprecipitation-PET (ChIP-PET) analyses have provided a global overview of p53 transcription factor binding sites in the human genome (6
), as well as Oct4
targets in the mouse genome (8
The PET concept can conceivably be applied to other DNA sequence analyses that will benefit from paired-end characterization, including the study of epigenetic elements and genome scaffolding. One point to note is that while the number of sequencing reads (~50
000) required for a comprehensive GIS-PET or ChIP-PET analysis is miniscule for most genome centers with state-of-the-art Sanger capillary sequencers, and within the reach of core facilities in university laboratories, the final cost of each PET experiment can be significant. Hence, we are continually seeking ways to improve the efficiency and cost-effectiveness of PET analysis.
Recently, a novel, highly-parallel multiplex sequencing-by-synthesis method based on pyrosequencing in picolitre-scale reactions (454-sequencing™) was reported, in which ~300
000 DNA templates were simultaneously sequenced in a single 4 h machine run to a read-length of ~100 bases, with an accuracy of 99.6% (9
). Although this multiplex sequencing approach, as described, potentially yields a remarkable 100-fold increase in throughput compared with current Sanger capillary sequencing technology, its obvious weaknesses are the short-read length that limits wider application to many genome sequencing projects, and its inability to obtain paired-end information.
Another recent advance is the Polony sequencing technology (10
) that has as its chief advantages low sequencing cost, and the ability to produce paired-end reads of DNA fragments at a raw data acquisition rate reportedly an order of magnitude faster than conventional Sanger sequencing. In its current manifestation, however, the technology suffers from a lower-than-predicted throughput (~140 bp/s) and raw base-calling accuracies poorer than in Sanger sequencing. In addition, an unusual sequencing-by-ligation scheme results in short, discontiguous paired-end tags (each of 13 bases interrupted by an indeterminate gap of 4 to 5 bases) that is insufficient for specific mapping in complex genomes, thus precluding the Polony method from applications involving mammalian genome sequencing.
It was apparent to us that a melding of technologies would be highly beneficial: the massively-parallel, short-read nature of the new 454-sequencing method lends itself well to enhanced PET analysis: each ~40 bp PET would compensate for the inherent disadvantages of short-reads by providing paired-end information from long contiguous DNA fragments. Mapping of these PETs to assembled genomes would allow the original sequence to be inferred. Furthermore, by a simple modification of the original GIS-PET procedure, we were able to easily dimerize PETs prior to multiplex sequencing, thereby further increasing data-gathering efficiency. Finally, the very high sequencing throughput of 454-sequencing should allow the global analysis of transcriptomes and genomes at an unprecedented speed.
In this report we describe the utility, efficiency and accuracy of applying this multiplex sequencing of paired-end ditags (MS-PET) analysis to characterize both the human transcriptome and genome.