Ten years ago, it was proposed that databases of randomly-selected
cDNA sequences would have applications in the discovery of new human
genes, mapping of the human genome and identification of coding
regions in genomic sequences (
1). The
fast throughput (allowed by only a minimum of characterisation)
has allowed the expressed sequence tag (EST) databases to grow rapidly.
While many uses for ESTs have indeed been found, there are two features
of mammalian genomes for which the establishment of these databases
has proved particularly helpful: (i) the numerous large families
of homologous genes, many of which are functionally redundant (
2,
3),
and (ii) the high degree of RNA splicing, with genes split
into tens or even hundreds of exons, often assembled into several alternatively
spliced mRNA variants (
4). Alternative
splicing may greatly increase the complexity of functional gene
products encoded by a genome (for a recent review, see
5). To a greater or lesser degree, most
eukaryotic genomes possess these characteristics. Thus, ESTs can
help to provide a more dynamic view of DNA coding content than arises
from sequencing just a single cDNA product and the gene itself,
as was most often done. Seen in this way, ESTs provide a most useful
counterpart to large scale genome sequencing projects. Enormous
effort has gone into the public human and mouse EST projects with >4.4
million ESTs currently available for these two species. Their genome
sequencing projects benefit directly in terms of genes revealed,
exons predicted and differentially spliced products mapped. Hence,
due to the limitations of current
de novo gene
prediction algorithms (
6,
7), EST matches are intrinsic to the gene
prediction protocol in the ‘real time’ human genome
annotation project ENSEMBL (
http://www.ensembl.org).
In this light, it seems regrettable that the two completed animal
genomes (
8,
9),
Caenorhabditis elegans and
Drosophila
melanogaster, have not been accompanied by similar large scale
EST projects. The 80 000
Drosophila ESTs are
only informative for highly expressed genes (
10).
Despite the utility of ESTs for gene expression analysis, the many
researchers who are primarily dependent on publicly available servers
may have found it difficult to use EST databases in a sensitive
and efficient manner. They are likely to be faced with a variety
of problems, amongst which are the following. (i) Providers of popular
online BLAST servers need to set restrictions on the query size
or CPU time per job to prevent resource hogging: many vertebrate
genes are too large to be submitted in a single query. (ii) Eukaryotic
genomes are full of dispersed repetitive sequences such as the Alu
and LINE-1 retroelements in the human genome (
11–
13): many of these are transcribed by
RNA polymerase III and/or are often present in RNA polymerase
II-transcribed 3′ non-coding exons and
can fill up the top scores in the BLAST output, hindering detection
of the true spliced exons. (iii) ESTs from closely related genes
(e.g. recently duplicated paralogous genes) may be present in the
output and could outnumber the true ESTs. (iv) Analysing the results
from BLAST output alone is time consuming and inefficient, while
other commonly used tools may be inappropriate: for example, the
Clustal W multiple sequence alignment program was not designed for
aligning (and cannot align) exons from multiple ESTs onto genomic sequence
(
14). Although this has often
been attempted, a multiple alignment calculation should be redundant,
since all the information needed to align the ESTs on the genome sequence
is intrinsic to the BLAST output.
The specific requirements for searching EST databases impose the
need to provide dedicated servers. The BLAST2-based EbEST server
(
http://rgd.mcw.edu/EBEST/)
allows large queries, clusters the matching ESTs and provides graphical
output summarising the matched exons (
15).
EbEST can integrate results from other gene prediction methods,
providing a useful pointer to potential gene structure. However,
the EbEST server does not provide the user with the raw findings
that might be needed to judge the quality of the interpretation
or reveal alternative processing. The GenSeqer dynamic programming software
is available as a server for plant ESTs only (
http://gremlin1.zool.iastate.edu/cgi-bin/gs.cgi)
or may be downloaded for local use (
16).
Query size of the server is limited, probably reflecting the computational
load imposed by the slow (but sensitive) dynamic programming algorithm,
which incorporates a hidden Markov model for splice site detection:
for databases of limited size, GeneSeqer is likely to outperform
BLAST2 servers in the detection of divergent sequences as well as
very short exons.
Although BLAST2 is probably not quite as sensitive as dynamic
programming algorithms, it is much faster. For the rapid detection
of exons in ESTs that are expressed from a given spliced gene, it
would be difficult to improve substantially on the BLAST2 algorithm
(
17). It can detect multiple
exons and tolerate sequencing error within an exon. The one obvious drawback
is that it cannot detect very short matches and, therefore, the
occasional very small exon (below ~20
bases long) will be overlooked.
Based on feedback from experimentalists wishing to access ESTs,
we felt that there was a need for an online resource for querying
EST databases that could not only provide researchers with a graphical
output summarising exon/intron structure, including any
splice variants, but would also provide the raw BLAST results so
that artefactual matches might be understood and eliminated. The
main computational tools needed to build this server already exist;
for example, RepeatMasker has become the standard tool for masking
dispersed repeats in genomic sequence (
18),
while ‘gapped’ BLAST (
17) is
well suited to homology searching with ESTs. Therefore, to meet
this need, we focussed on (i) the integration of these tools to
give optimal search results, and (ii) parsing the BLAST output and
designing appropriate presentation for the EST matches, which has
been achieved by providing the user with new alignment and graphical
display outputs, the latter using the Artemis sequence display program
(
19). This manuscript outlines
the development and application of the Gene2EST server which, it
is hoped, will prove to be of some use to the research community.