EGassembler consists of a pipeline of five components, each using highly reliable open-source tools (4
) and a non-redundant custom-made database of repeats (EGrep) and vectors (EGvec) covering almost all publicly available vectors and repeats databases. The EGrep is a non-redundant repeats database covering latest release of the RepBase (10
), TREP (11
), TIGR plant repeats (12
) and thousands other publicly available repeat sequences on the Internet. The EGrep was constructed by combining and assembling repetitive elements using PHRAP and CAP3 assembling programs into one single database. EGvec was made by assembling the NCBI's UniVec and EMBL's emvec vector/adaptor library and other vector sequences using CAP3 program.
shows a flow chart of the EGassembler process. The web server accepts any type of DNA sequences in FASTA format (EST, GSS, cDNA, gDNA). The sequence cleaning process involves basic procedures such as, removing the polyA/polyT tail, clipping low-quality ends (the ends rich in undetermined bases) and discarding those that are too short (shorter than 100) or which appear to be mostly low-complexity sequences. The repeat masking process compares the query sequence against one or more files of FASTA sequences (library for masking). Masking vectors and organelles is performed using the program Cross_Match (9
) where is a general-purpose utility for comparing any two sets of DNA sequence. It is used to compare query sequences to a set of vector or organelle sequences and produce vector/organelle masked versions of the input sequences. The sequence assembling process uses the CAP3 program (7
) for Clustering and assembling the sequences into contigs and singletons. CAP3 assembles ESTs from the same gene under more stringent criteria compared with other approaches, and is able to distinguish gene family members while tolerating sequencing error.
Figure 1 EGassembler data flow. The flowchart shows the pipeline used in the EGassembler web server. The Middle portion shows the process and running modes (parallel or single). The right side shows each process action and the left side shows the databases used (more ...)
All of the processes in the pipeline, except the assembling step, run in parallel using all CPU resources available on the server. Those programs that were originally written as serial programs, using only one CPU, are now executed in parallel by implementation of a new algorithm using the Perl thread module. This implementation is especially valuable for trimming the vector and masking the organelle sequences. Using the original program on a single CPU required several days depending on input sequences, but now it takes only a few hours. shows a diagram of the EGassembler performance under different loads.
EGassembler performance. The large plot shows the EGassembler performance under different sequence loads and different numbers of CPUs. The inset displays the performance with ≤8000 sequences.
The main menu on EGassembler interface has three sub-menus providing users with the following processing options.
All the components in the pipeline are run consecutively with their default options. After uploading the sequences, choosing the libraries for trimming and masking, assembling results can be obtained in one-click. The results of all steps are available to users for downloading as both URL addresses in one single-zipped file and as separate files for each step. The URL addresses of results are valid for access by users for one week after completion.
Users run all the components outlined in the pipeline interactively and have the opportunity to run each one of them with advanced options. The output of each step of the process is automatically used as the input to the next step of the pipeline; users can also jump into any step at anytime with the previous results.
Users can use each one of the components alone with all options available. Web-interface displays the default parameters of the original programs, any of which users can choose/change for each program.