ESTs have been prevalent in genomic research since the first large scale EST project in 1991 [
1]. There are many EST projects that study the gene content of genome, tissue, or condition-specific transcripts (e.g. see Additional file
1: List of EST papers, section 4). In October 2005, 454 Life Sciences released the GS 20 pyrosequencer that generates over 100,000 reads per run with an average length of 110 bases [
2-
4]. In January 2007, they released the GS FLX that generates over 200,000 reads with length between 200–300. Table shows the growth of the number of ESTs in GenBank in relation to their length. Many of the short sequences released after 2005 have been generated by the GS 20 or GS FLX 454 (there is no explicit field in GenBank stating the type). With the release of the Titanium 454 in October 2008, which produces reads of length 400 [
4], we can expect to see the prevalence of 454 ESTs of length 400+ grow quickly.
| Table 1ESTs added to GenBank during specified years based on length |
Besides the 454 sequencer, the following are also next-generation sequencers that generate high-throughput short reads: Illumina Genome Analyzer [
5] developed by Solexa (Cambridge, UN), Applied Biosystems SOLiD Sequencing [
6,
7], and Helicos GSS Sequencing [
8]. The 454 sequencer comes with the Newbler assembler, and there are multiple assembly packages tested for the Illumina system [
9-
15]. However, these are all tested on small genomes, chromosomes or BACs, which will have much shallower coverage compared to EST contigs.
Many current EST projects have generated 454 data and either used traditional EST assembly approaches (e.g. [
16]) or aligned the ESTs to a related genome or assembled transcripts (e.g. [
17]; see Additional file
1: List of EST papers, section 4.B). Laboratories are now transitioning between the traditional Sanger ESTs and new 454 ESTs. For example, our laboratory has a full-length cDNA project using Sanger 5' and 3' ESTs, and three other projects that have a mix of 454 and Sanger ESTs. For our Sanger projects, we developed a software package called PAVE (Program for Assembling and Viewing ESTs) that utilizes mate-pair information. With the release of the 454 sequencer, we extended PAVE to work for the increased depth of the 454 EST data sets.
The ESTs generated by Sanger versus the 454 sequencer differ in number and length. The current Sanger ESTs have an average length of around 650 good bases, but the number of ESTs that are sequenced is generally low. The cDNA sample must first be cloned into a vector (typically either plasmid- or phage-based) to produce a cDNA library and then individual clones are isolated from the library and sequenced, which results in a few thousand clones being sequenced. For example, most maize libraries in GenBank have between 1000 and 10,000 ESTs. The 454 GS FLX sequencer can generate over 200,000 good ESTs per project, but at an average length of only 250 trimmed bases. The new 454 GS FLX Titanium is capable of generating over a million reads of 400 bases with reduced error [
4]. This technology currently does not produce identifiable cDNA mate-pairs.
Sanger sequencing can produce mate-pairs where it is known which ESTs are mates based on their name. If the clone is full length, the 5' end will start at the beginning of the transcript, otherwise it can start anywhere within the original mRNA sequence. It is now relatively inexpensive to generate both the 3' and 5' reads of a clone, as the library only needs to be prepped once. To date no software exists that takes full advantage of mate-pair information in order to produce better contigs. CAP3 [
18] uses mate-pair information to build contigs but does not require that they be in the same contig. Phrap [
19] uses mate-pair information to flag potential chimeric clones by inserting the chimeric mate into a contig. Clustering programs, such as STACK [
20] and PaCE [
21], will use mate-pair information to join clusters, but these may be broken into multiple contigs when assembled by CAP3 or Phrap. By contrast, PAVE requires mate-pairs to be in the same contig. It does not allow mate-pairs to be split across contigs, and if none of the ESTs in the 5' and 3' sub-contigs overlap, they are joined by n's. PAVE has been used to assemble multiple projects including 797,619 maize ESTs from GenBank [
22].
The advent of 454 sequences presents new challenges to assembly. The increased depth of the 454 data sets can cause CAP3 and other assembly programs to run out of memory. Moreover, assembling large contigs (e.g. > 1000 ESTs) is time-consuming. To address both problems, ESTs contained in another may be removed, such as performed by PlantGDB [
23,
24]. PAVE removes ("buries") many of the ESTs that are contained in another ("parent" EST) during assembly; after assembly, the buried ESTs are placed in their parents' respective contigs in order to retain the expression level.
There are quite a few packages for the pipeline processing that EST data requires (see Additional file
1: List of EST papers, section 2). For example, EST2uni [
25] is a pipeline that trims and cleans ESTS using external programs such as Lucy [
26], assembles the ESTs with CAP3 or TGICL [
27], and has annotation capabilities. We find that ESTs generated from different technologies and laboratories require different trimming and cleaning processes, so these functions are not part of the PAVE software package. However, PAVE includes a data management system in order to allow assembling many libraries together while retaining the information about each library. The PAVE system contains a Java program called jPAVE that allows easy verification and display of assemblies in the PAVE MySQL database. The system supports annotation by UniProt [
28] match, GC content, ORFs, R statistic [
29] and comparison of contigs. The assembly software, data management software, and jPAVE viewer are freely available along with a user's guide [
30].