Despite considerable efforts in the Bioinformatics community, the performance of existing gene prediction tools is still not satisfactory. In current genome projects, a common approach for gene finding is the following. Several sets of gene predictions are compiled, usually from different gene finders trained specifically for the species at hand. Further, alignments from ESTs and proteins to the genome are constructed. Finally, the predictions and the alignments are combined to find plausible gene structures, either manually or by using meta tools that combine several predictions and alignments [e.g. (1
AUGUSTUS is a gene finder based on a Generalized Hidden Markov Model (GHMM) (2
). The original version of the program was a purely ab initio
method, i.e. its prediction was based on information contained in the genomic sequence to be analyzed. An extended version of the program is able to use additional extrinsic information, for example matches to protein databases or alignments of genomic sequences, to improve the prediction accuracy (4
). At the recent EGASP
workshop in Cambridge, UK, a systematic evaluation of existing gene finders for the human genome has been carried out based on a large set of well-annotated parts of the human genome (5
). At this workshop, AUGUSTUS turned out to be the best program in the category of ab initio
gene prediction. Its performance could be further improved by using BLAST (6
) hits to EST or protein sequences and alignments of syntenic genomic sequences using DIALIGN (7
); in this category, however, the program was outperformed by N-Scan
), a new program based on multiple alignments of genomic sequences. Compared to more traditional approaches, gene-finding methods based on genomic sequence alignments have a considerable advantage since they do not depend on EST or protein sequences or statistical models of gene structures (10
). On the other hand, alignment-based methods work only if genome sequences at an appropriate evolutionary distance are available. Although the performance of ab initio
gene-prediction methods is usually improved if information from comparative sequence analysis is added, ab initio
gene prediction remains highly important since for many newly sequenced genomes, few EST or related genomic sequences sequences are available and comparison to protein sequences can find only those genes that have close relatives in existing databases.
To make AUGUSTUS available to the research community, we set up a WWW server at Göttingen Bioinformatics Compute Server (GOBICS) (14
). Like most gene-prediction methods that are currently available, earlier versions of AUGUSTUS predicted exactly one transcript per gene and ignored the fact that one gene often yields more than one distinct mRNA product. It has been estimated that 40–60% of all human genes have alternative splice forms. Of those genes 70–88% of alternative splices change the protein product; the remaining splice variants differ in the untranslated regions only (16
). Thus, it is important to have gene-finding tools that are able to deal with this phenomenon. The program SLAM (17
), for example, predicts alternative splice variants. This program, however, is based on alignments of genomic sequences, and it requires two syntenic genomic sequences as input data. We recently installed a new version of AUGUSTUS at our server that can predict multiple transcripts for predicted genes. To our knowledge, this is the first ab initio
gene finder that can predict multiple transcripts, and our web server is the only gene prediction web server with this option.
With our new alternative-transcripts option, the user can control the number of predicted splice variants per gene. This way, it is possible to influence sensitivity and specificity of the program output. If predicted genes or transcripts are automatically evaluated and post-processed, high prediction sensitivity may be desirable to increase the number of candidate genes that are to be analyzed, even if this increases the number of false-positive predictions. In contrast, if expensive experiments are carried out based on computationally predicted genes, it is preferable to have highly specific tools that minimize the risk of false-positive predictions. Thus, a good gene-finding method should allow the user to chose between high sensitivity and high specificity. At our server, this can be done by specifying the maximum number of predicted splice variants. In addition, we implemented a motif-searching option at our server where predicted genes can be searched for user-specified regular expressions, e.g. PROSITE patterns.