SNAP is a high-performance ab initio gene finder
I trained and evaluated SNAP in four genomes (see Methods) and compared its performance to Genscan in all genomes, to HMMGene [12
] and Genefinder [14
] in C. elegans
, and to Augustus [15
] in D. melanogaster
. Genscan performs as well as recent gene finders designed specifically for Arabidopsis
], was considered one of the standards for the Drosophila
GASP experiments [3
], and is one of the gene finders used by the International Rice Genome Sequencing Project [17
]. HMMGene and Genefinder are well-established gene prediction programs for C. elegans
. Augustus is one of the latest gene prediction programs and has been shown to outperform Genscan, GENIE, and GENEID in Drosophila
As shown in table , SNAP is more accurate than Genscan in every genome. In C. elegans, SNAP performs better than HMMGene and almost as well as Genefinder. In D. melanogaster, SNAP is similar to Augustus. The HMMs employed by SNAP in this study have a minimal genome model without a promoter, poly-A signal, or UTRs, and the reason why SNAP outperforms Genscan is simply that it is trained for each genome. When compared to gene finders tuned for a particular genome, SNAP performs about equally. It may be possible to increase the accuracy of SNAP by including more states in the HMM to model additional genomic features or by using more sophisticated statistical techniques such as interpolated Markov models, maximum dependence decomposition trees or isochore segmentation. Fine-tuning SNAP to particular genomes is not the subject of this study however.
Gene prediction performance Performance figures for SNAP are derived from 5-fold cross-validation. At Arabidopsis thaliana, Ce Caenorhabditis elegans, Dm Drosophila melanogaster, Os Oryza sativa. SN and SP correspond to sensitivity and specificity.
Gene prediction in novel genomes can be highly inaccurate
A newly sequenced genome may not have much training material and little experimental data to anchor gene predictions. How can one find genes in such uncharted territory? The common procedure is to use a gene finder from some other genome, perhaps the one that is most phylogenetically similar. To determine the consequences of this practice I evaluated the difference between the intra- and inter-species performance of SNAP. The results are displayed in table .
Table 3 Intra- and inter-species gene prediction accuracy Intra-species performance figures derived from 5-fold cross-validation are along the major diagonal in bold. At Arabidopsis thaliana, Ce Caenorhabditis elegans, Dm Drosophila melanogaster, Os Oryza sativa (more ...)
Gene prediction accuracy with foreign parameters appears to follow GC content more than phylogenetic relationships. For example, Oryza parameters perform reasonably well in Drosophila sequence (>25% genes correct) but very poorly in A. thaliana (5% genes correct). Similarly, for finding C. elegans genes, one is better off with parameters from A. thaliana than D. melanogaster. Choosing the best foreign gene finder is therefore not simply a matter of using parameters from the closest relative.
Genomes have significant compositional differences
To look more closely at the reasons behind the inaccuracy of foreign parameters, I have examined compositional differences in coding sequence, splice sites, and the translation start. Figure displays the codon frequencies of degenerate codons for each of the four genomes. In general, codon preference is reflected by GC composition. That is, the GC-rich genomes prefer G and C in the 3rd position and the AT-rich genomes prefer A or T. This helps to explain the results of the previous experiment. But even between genomes with similar GC-content there are significant differences among equivalent codons. For example, Oryza prefers CTC for Leucine while Drosophila prefers CTG.
Codon frequency The frequency of each degenerate codon is indicated in a species-specific color (At Arabidopsis thaliana, Ce Caenorhabditis elegans, Dm Drosophila melanogaster, Os Oryza sativa). Codons are grouped by their parent amino acid.
] for splice sites and the translation start are shown in figure . The two plant genomes (Arabidopsis
) have very similar acceptor sites, but the two animal genomes (Caenorhabditis
) each have unique features. The upstream region from -7 to -27, frequently known as the poly-pyrimidine tract is T-rich in plants, AT-rich in Caenorhabditis
, and pyrimidine-rich (proximally) in Drosophila
. T is the most common nucleotide at positions at -5 and -6 in all genomes, but in Caenorhabditis
these are almost invariant. At -4, the plants prefer G while Caenorhabditis
prefers T and Drosophila
is unbiased. The splice donor sites appear to be similar among all the genomes. However, there may be species-specific higher-order contexts that contribute to some specificity since the tuned architectures (see Methods) are slightly different for each genome. The translation start site appears to have some genome specificity. Given the compositional differences in the various signals, it is not surprising that gene prediction with foreign gene finders can be highly inaccurate.
Figure 3 Pictograms of splice sites and translation start The height of each letter is proportional to its frequency. At Arabidopsis thaliana, Ce Caenorhabditis elegans, Dm Drosophila melanogaster, Os Oryza sativa. (a) splice acceptor site – canonical (more ...)
Parameter estimation in novel genomes
Even though foreign gene finders may perform sub-optimally, their predictions may display compositional properties of the novel genome. For example, when annotating the A. thaliana genome with a C. elegans gene finder, the predictions appeared very much like real A. thaliana genes. Figure shows a splice acceptor pictogram derived from these predictions. Note that the sequence composition broadly resembles true A. thaliana splice acceptors, including a preference for G at -3 and a T-rich upstream sequence. It also retains some C. elegans qualities such as a greater proportion of Ts at -5 and -6.
If predicted genes have compositional properties similar to real genes, it should be possible to train a gene finder for a novel genome in the absence of any data. One simply runs a foreign gene finder (or more than one) to create a virtual training set and estimates parameters for the novel genome from the virtual data. So rather than use foreign gene finders to identify genes, one uses them to bootstrap parameter estimation. To determine if this procedure works, I assumed one of the genomes was novel and evaluated the gene prediction accuracy of bootstrapped parameters derived from one, two, or three foreign gene finders. The results are displayed in table .
Table 4 Performance of foreign and bootstrapped parameters The foreign parameter data (top part of the table) is similar to table but use the default rather than tuned architectures (see Methods). The bold face values are determined by 5-fold (more ...)
Bootstrapped parameters work very well in many cases. In A. thaliana, the best foreign parameters come from C. elegans. But if the gene predictions are used for training rather than the final annotation, the prediction accuracy rises from 86.3/91.0 (nucleotide sensitivity/nucleotide specificity) to 96.6/93.2. The worst foreign performance (D. melanogaster parameters) changes from an abysmal 26.0/96.0 to a respectable 75.2/95.5. Bootstrapped parameters also appear to work very well in C. elegans and D. melanogaster. In C. elegans, the highest performing sets, 96.7/91.1 and 95.8/91.9, come from mixing two or three gene finders. In Drosophila, the best parameters are derived from O. sativa predictions. In general, bootstrapped parameters in these genomes can rival actual data, and even the worst bootstrapped parameters are reasonably accurate.
In O. sativa, bootstrapped parameters are only somewhat helpful. The Arabidopsis and Caenorhabditis parameters improve in both sensitivity and specificity, but the Drosophila foreign parameters are actually better than any of the bootstrapped ones. The reason why O. sativa behaves differently from the others is unclear at this time. In general, however, estimating parameters from gene predictions appears to be a simple and convenient way to train a gene finder for a novel genome. It is important to note that the genomes studied in this paper are all relatively compact. The techniques here may not work well in mammalian genomes or other genomes where exons account for a small fraction of the total sequence. I am currently investigating methods to improve gene finding accuracy and to quickly estimate gene prediction parameters in large genomes.