Tag-based sequencing strategies such as Serial Analysis of Gene Expression (SAGE) are efficient for analyzing DNA fragments in transcriptome characterization and genome annotation studies [
1-
3]. However, the information content in each SAGE tag based on an anchored restriction enzyme recognition site within the DNA segment is limited, and the mapping of SAGE tags to genome sequences for transcript identification can be ambiguous. Despite the recent improvements in tagging 5' terminal signatures of cDNA [
4,
5] to determine transcription start sites (TSS), the most significant advance in this field is the simultaneous tagging of 5' and 3' terminal signatures of DNA fragments subjected to study. In this effort, we first developed an intermediate approach that precisely extracts separate 5' and 3' terminal tags from cDNA fragments for sequencing [
6]. With this new capability, we proceeded to design and develop a cloning strategy, called Gene Identification Signature (GIS) analysis, which covalently links the 5' and 3' signatures of each full-length transcript into a Paired-End diTag (PET) structure [
7]. In a GIS-PET experiment, most of the PETs are 36bp in length (18bp for the 5' signature tag and 18bp for the 3' signature tag); and multiple PETs can be concatenated together to form longer stretches of DNA fragments for efficient high-throughput sequencing. An average sequencing read (700–800bp) of a GIS-PET library clone can reveal 10–15 PET units, which is equivalent to 30 conventional cDNA sequencing reads for 15 cDNA clones analyzed from both ends. The PET sequences can then be accurately mapped to the reference genome sequences and precisely demarcate the boundaries of transcription units in the genome landscape. With this combined efficiency and accuracy of GIS-PET, a mammalian transcriptome can be thoroughly analyzed using hundreds of thousands high quality transcript sequences by a modest sequencing effort as further demonstrated in the comprehensive characterization of mouse transcriptomes [
8]. The PET-based DNA analysis strategy has also been applied to characterize genomic DNA fragments generated by chromatin immunoprecipitation (ChIP) enriched for specific binding targets by given DNA-binding proteins, and whole genome ChIP-PET data has provided global maps of transcription factor binding sites for p53 in the human genome [
9] and Oct4 and Nanog in the mouse genome [
10]. PET-based DNA analyses (GIS-PET and ChIP-PET) promise to play a significant role in the post-genome efforts to identify all functional elements in the human genome [
11], and there is no inherent limit for the PET-based approach to be applied to other DNA analyses, such as analyses of epigenetic elements.
To fully appreciate the potential of PET-based sequencing analyses, we have to develop sophisticated informatics capabilities to manage the large volume of specific PET sequences generated from each of the GIS-PET and ChIP-PET experiments. There is a battery of new bioinformatics challenges around how to accurately identify and extract PET sequences embedded in raw sequence reads, how to specifically and efficiently map the paired 5' and 3' signatures of PET sequences in complex genomes such as the human and mouse genome sequences; and how to be user-friendly in managing the immense amount of data generated from GIS-PET and ChIP-PET experiments for effective data mining and analysis. Based on the paired end nature of PET sequences generated from GIS-PET and ChIP-PET experiments, the issues are far more complicated than those related to SAGE-like mono-tags and therefore can not be handled by available software packages previously developed for SAGE analysis [
12-
15].
To accommodate and process PET sequence data, we developed a complete software suite called PET-Tool that is designed to provide complete solutions starting from extracting PET sequences from raw sequencing reads, to mapping the PET sequences to the reference genomes. Here in this study, we describe the architecture design, technical details of implementation, utility, and robustness of PET-Tool by analyzing four datasets generated from two GIS-PET libraries and two ChIP-PET libraries.
The architecture of PET-Tool
PET-Tool is designed to provide complete solutions for processing and managing the PET data generated from GIS-PET and ChIP-PET experiments. In these experiments, either full-length cDNA or genomic DNA enriched by ChIP are converted into PET structures that are further concatenated and cloned into plasmid vector for sequencing analysis [
7,
9]. The core functions of the Tool are to extract PET sequences from raw DNA sequence reads and map the PET sequences to the genome sequences. In addition, we want the Tool to be able to manage large volume of PET data generated from each PET experiment and provide user-friendly analytic functions to evaluate the quality of each PET dataset. The design of PET Tool is comprised of four modules: Extractor, Examiner, Mapper, and ProjectManager (Figure ). In the PET-Tool system, Extractor uploads raw sequence files and de-convolutes the PET sequence units embedded in each raw sequence read to generate PET sequences, which are stored in a relational database. Examiner provides an analytical capability for users to examine and validate the PET extraction results. It provides the basic statistics of PETs in each project, library, and plate of sequences. It also presents graphic dissection for each of the input sequence reads and highlights the sequence sections with various color codes to help users to distinguish vector flanking regions, spacer sequences between PET units, and the PET sequences themselves. This ability allows users to identify any potential irregularities in the sequence, and adjust extraction parameters. The Mapper module is to map the quality PET sequences to the corresponding genome sequences. For efficient mapping of large volumes of PET sequences, we used a newly developed alignment approach that was based on compressed suffix array (CSA), in which the entire genome sequence assembly was indexed as a reference database, and the 5' and 3' signatures of a PET sequence were matched to the genomic index [unpublished results]. The ProjectManager module organizes the data and analysis results in a hierarchical order, in which multiple projects can be managed at various levels of organisms, libraries, raw DNA sequences (in plate and single well format), PETs, and the attributes of each PET.
Implementation
PET-Tool is implemented for both UNIX and LINUX. The web-based user interface is implemented in Perl/CGI and hosted by Apache web server. The interface of the Tool can be accessed by any web-browser that supports the current web standards.
Data storage is facilitated by a combination of flat file system and mySQL based Relational Database Management System (RDBMS). The mySQL database was used for efficient and fast PET data storage, tracking, retrieving, and interfacing with back-end programs through Perl:DBI module. We also applied mySQL to host various statistical data and mapping results. Flat files were used for storage of uploaded sequence data, with the positional indices of all sequences stored in mySQL database for quick sequence retrieval. Back-end programs were implemented in Perl and C languages. Compressed Suffix Array (CSA) programs were implemented in C language for high efficiency and robust performance of advanced data structures. Programs for PET sequence extraction, statistic computation, data retrieval/storage, web-interaction and other non-intensive tasks were implemented in Perl. Minimum hardware requirements include Pentium III processor, CPU of 500 MHz, 256 Mega byte RAM, and 20 Giga-byte hard disk drive. A regular 500 MHz machine would take about two days to process a library of one million PETs. If a computer was equipped with 2.4 GHz processor, the same job could be done in a few hours.