|Home | About | Journals | Submit | Contact Us | Français|
We present PEACE, a stand-alone tool for high-throughput ab initio clustering of transcript fragment sequences produced by Next Generation or Sanger Sequencing technologies. It is freely available from www.peace-tools.org. Installed and managed through a downloadable user-friendly graphical user interface (GUI), PEACE can process large data sets of transcript fragments of length 50 bases or greater, grouping the fragments by gene associations with a sensitivity comparable to leading clustering tools. Once clustered, the user can employ the GUI's analysis functions, facilitating the easy collection of statistics and allowing them to single out specific clusters for more comprehensive study or assembly. Using a novel minimum spanning tree-based clustering method, PEACE is the equal of leading tools in the literature, with an interface making it accessible to any user. It produces results of quality virtually identical to those of the WCD tool when applied to Sanger sequences, significantly improved results over WCD and TGICL when applied to the products of Next Generation Sequencing Technology and significantly improved results over Cap3 in both cases. In short, PEACE provides an intuitive GUI and a feature-rich, parallel clustering engine that proves to be a valuable addition to the leading cDNA clustering tools.
Understanding an organism's transcriptome, the set of (spliced) transcripts expressed by genes of the organism, is a vital step in understanding the full functional and organizational role of the genome in the life cycle of any eukaryote. Studying the transcriptome has led to gene discovery, provided information on splice variants and helped shed light on the biological processes both controlling and controlled by the genome (1). However, to access those transcripts, we must deal with the fragmented data produced by both Next Generation and traditional Sanger sequencing technology.
In the past, access to a transcriptome sequence was primarily through the use of expressed sequence tags (ESTs), single-pass cDNA sequences derived from transcribed mRNAs and sequenced by Sanger sequencing technology. More recently, Next Generation Sequencing (NGS) technology has begun to rapidly replace Sanger equencing. For example, ESTs now being added to the GenBank dbEST are increasingly the product of NGS technologies such as 454 pyrosequencing, which enables the sequencing of novel and rare transcripts at a considerably higher speed (2,3). From a computational perspective, this is a mixed blessing: while NGS provides immense quantities of new information, it also provides immensely larger data sets—and thus a need for fast, efficient analysis algorithms.
Given a set of transcript fragments sampled from across the genome, a necessary first step of the set's analysis is clustering: separating the fragments according to the transcript from which they were derived. Frequently performed implicitly by assembly tools, clustering the data as a ‘pre-assembly’ step has a number of advantages. Most significantly, performing this step will allow the application of the assembly tool to individual clusters—saving significant amounts of time (4).
Clustering is a computationally challenging problem; the run time and memory requirements to cluster on the basis of pairwise sequence alignments make such an approach infeasible in practice. To deal with this, PEACE combines our own version of the d2 alignment-free sequence distance function (5) and the concept of a minimum spanning tree (MST) (6) to quickly and accurately find clusters of ESTs expressed from the same gene without reference to a sequenced genome. Compared against clustering tools in the literature (4,7–15), PEACE produces results of quality competitive with the WCD and TGICL tools (4,15), and more sensitive and robust than other tools. From the user perspective, no tool in the literature can match PEACE for ease of installation or use, or the post-clustering analysis tools PEACE provides.
In short, PEACE is a computational tool for the ab initio clustering of transcript fragments by gene association, applicable to both NGS and traditional Sanger sequencing technologies. Available through the www.peace-tools.org website, the PEACE GUI allows the user to both easily install (locally or remotely) and run the clustering engine, as well as enabling transparent parallel processing and providing various tools for result analysis.
The PEACE GUI, available as a JAR file from the PEACE website (www.peace-tools.org), can be run on any machine supporting the standard Java Virtual Machine (JVM). The user can employ the GUI to install the clustering engine and perform a clustering of a data file in FASTA format, view an analysis of the clusters and produce files containing subsets of the clusters as input to assembly tools such as Cap3 (9). A typical (first) use of PEACE must be performed in the following manner (Figure 1).
To install the PEACE clustering engine onto a local or remote machine, the user selects from within the GUI the appropriate menu tab (Figure 2a), which then starts an install wizard that will prompt for the appropriate information. Figure 2b illustrates the request for server information; the user has chosen to install the PEACE computational tool on a remote machine and is providing the necessary connection information. Server information is persistent between GUI sessions, giving the user access to PEACE on the target machine as needed.
After importing the target sequence file into the GUI, the user starts a new job by following the wizard menus. Figure 2c illustrates the process of specifying the number of processors available (if running on a machine supporting the MPI protocol—determined during job installation). Once executed, the GUI will manage the job thread, alert the user when the job is completed (or when the user next runs the GUI after completion) and copy the final results back to the local machine if necessary.
Once the resulting clusters have been computed, the user has several options for analysis:
Extensive documentation for the tool has been posted on the PEACE website, as well as links to several tutorial videos demonstrating PEACE use and capabilities.
The clustering performed by PEACE is based on the use of MSTs, known to be an effective approach for narrow band single linkage clustering (16,17). Using a graph structure to model the fragment relationships and the d2 distance measure to assign edge weights (5), we can employ Prim's algorithm (6) to efficiently calculate an MST from which we can infer a high-quality clustering solution.
The d2 distance measure used to assign edge weights is an alignment-free measurement of sequence distance that can be calculated significantly faster than a Smith–Waterman alignment (5). The d2 algorithm works by comparing the frequency of words (strings of a fixed length) appearing in a limited region of each string. Fragments overlapping by a sufficient length will share neighborhoods of enough similarity to ensure a small distance even in the presence of a moderate number of base errors. In practice we employ our own variation of d2, the ‘two-pass d2 algorithm’, which heuristically searches for a neighborhood of maximum similarity and then finds the d2 score based on that neighborhood (see Supplementary Data for details).
Fragment input is modeled as a weighted, undirected graph, the fragments are represented as nodes, with d2 sequence distances assigned to the connecting edges as weights. Conceptually, we want to remove each edge exceeding a threshold score from the complete graph and define our partitions by the remaining connected components. An edge with a large weight connects fragments that are likely unrelated; once such edges are removed, the components define a series of overlaps. Those fragments that can still be connected by some path correspond to the same gene. However, such an approach requires the calculation of all edge weights. That task is infeasible both in terms of run time and memory usage for the data set sizes we expect to process.
PEACE approaches the problem by generating an MST of the described graph, then removing edges exceeding our threshold. By using Prim's algorithm, we are able to calculate edge weights on-the-fly (reducing memory requirements) and can skip the calculation of a majority of edge distances using the u/v and t/v filtering heuristics employed in WCD (4). These heuristics allow us to quickly dismiss many of the edges as too large without the need to apply the full d2 algorithm (see Sections A.3 of the Supplementary Data for more details).
PEACE has been tested on both simulated and real data from NGS and Sanger sequencing technologies, comparing results against those produced by the WCD clustering tool (4) and the Cap3 assembly tool (9) (the latter of which implicitly calculates a clustering in the process of assembly). For our simulation tests, we used the ESTSim tool to generate simulated Sanger-sequenced transcript fragments and the MetaSim tool to generate simulated short-read sequences from 454 and Illumina technologies (18,19). (See Supplementary Data, Section D.1, for details on the sequence size and error models.) Fragments were generated from the list of 100 zebrafish genes used in the WCD testing (4). Tool parameters were taken to match, as closely as possible, those used in the WCD study (see Supplementary Data).
The most important method of quality assessment is ‘sensitivity’ (the fraction of fragment pairs from the same gene that were correctly clustered together). We also look at the ‘Jaccard Index’ (which balances sensitivity with the number of false positives), ‘Type 1 error’ (the fraction of genes that were divided between clusters), and ‘Type 2 error’ (the fraction of clusters containing two or more genes) (4,20). In Figure 3, we plot these four tests as a function of error rate and observe the almost identical results between PEACE and WCD. In Figure 4, we plot the run time for PEACE and WCD, again observing almost identical results when run sequentially—but significantly faster run time for PEACE on multiple processors when holding the EST/processor ratio constant (ranging from a 65% improvement for two processors to a 17% improvement for 12 processors).
In the course of our simulations, we also investigated the memory footprint—of particular concern in the large data sets commonly encountered. We find that, as with WCD, the memory used for execution is linear in the size of the data set, requiring slightly more memory than WCD and considerably less than Cap3. (See Section D.4 and Supplementary Figure S3 of the Supplementary Data.)
We note, as a point of interest, that we can significantly improve the quality of PEACE simulation results through the increase of the threshold value (see Supplementary Data, Section D.5)—achieving a significant improvement in PEACE sensitivity without an adverse effect on the Jaccard Index. However, the improvement does not carry over to the application of real data (Supplementary Data, Section E.2), where we observe a significant increase in the incorrect merging of clusters. While this might be acceptable to a user planning to employ an assembly tool capable of breaking up the clusters (but unable to remerge a split cluster), for the purposes of clustering assessment it makes sense to base our analysis (and set our default values) at the lower threshold.
Comparisons against TGICL (15) are more complicated. We find that while TGICL is more sensitive at lower error rates, PEACE is more robust to error, and in all cases PEACE has an improved Type 2 error rate and is better able to distinguish duplicated genes. (See Supplementary Data, Section D.6.)
In applying the tools to real Sanger data, we used the Human Benchmark data set used to test EasyCluster (14) and the A076941 Arabidopsis thaliana data set used to test WCD (4) (Table 1). We notice mixed results, with PEACE, WCD and TGICL showing comparable similarity, PEACE and WCD showing some superiority to TGICL in Type 1 error rate, but with a poorer showing in Type 2 error rate, and with all three tools coming considerably closer to the correct number of clusters than does Cap3. In run time, we see some inconsistency, with PEACE showing a 60% run time improvement over WCD in the first data set and a time comparable to TGICL, but requiring 20% more time than WCD in the Arabidopsis, data set and losing to TGICL by a large margin. In Supplementary Table S1, we present run times of several more sets for the distance-based tools, observing that while PEACE appears to be significantly faster on the smaller sets, WCD does overtake it for larger sets.
We tested the tools on short-read data using the MetaSim tool of Richter et al. (19). Encoded into MetaSim are sequence generation and error models corresponding to several technologies, including the 454 short-read technology (producing reads ranging in size from 200 bp to 370 bp with a mean of 250 bp and a standard deviation of 17 bp) and the Illumina short-read technology (producing reads of exactly 62 bp). In the 454 experiments, PEACE achieved a sensitivity of 0.871, an 89.7% improvement in sensitivity over WCD and a 326% improvement in sensitivity over TGICL, with comparable values for the Jaccard Index. WCD was unable to handle the Illumina data, while PEACE sensitivity, at a value of 0.384, was three times as good as TGICL (for more details, see Supplementary Table S3). In short, PEACE performs quite well for 454 reads and provides some useful information for Illumina reads, with considerably better results than WCD or TGICL.
We have presented PEACE, a stand-alone tool for the high-throughput clustering of transcript fragments capable of dealing with sequences as short as 50 bases. PEACE is open source and managed through a user-friendly GUI that enables both local and remote installation and execution in parallel mode. Using a novel MST-based algorithm for the clustering of fragments by gene association, PEACE shows significant improvement in sensitivity over the competing WCD tool and TGICL tool when applied to NGS reads (4,15), matches WCD when applied to Sanger sequencing output and shows an order of magnitude in improvement over the clustering performed in the course of assembly by the Cap3 tool (9).
As a clustering tool based on sequence distance, PEACE faces certain limitations. PEACE cannot handle duplicate genes; like WCD, it is unable to separate clusters corresponding to genes with a >88% similarity. Similarly, other natural biological effects (e.g. the trans-splicing of transcripts), effects from poorly cleaned transcript data (e.g. the failure to remove sequencing adapters or post-transcriptional poly(A)/(T) tails) and the presence of low-complexity repeats can cause similar effects in these clustering tools. These problems can be handled through the application of the assembler, while the ability to apply any assembler to small clusters results in a significant reduction in overall assembly time.
Peace can be downloaded from www.peace-tools.org, where we are committed to keep maintaining and improving the tool in the future. Meanwhile, we are developing our own MST-based assembly tool that can seamlessly integrate with PEACE. The underlying modular design of PEACE offers users many possibilities to expand and incorporate the MST algorithm for other bioinformatics applications.
Supplementary Data are available at NAR Online.
Funding for open access charge: Miami University, Departments of Computer Science and Botany.
Conflict of interest statement. None declared.
We would also like to acknowledge Iddo Friedberg, David Woods, Elizabeth Bikun, Jens Mueller and David Scoville at Miami University for their help with this project.