In this paper we have evaluated how effective purely computational approaches for genome annotation can be, even in the absence of a large collection of previously known genes, by means of the largest attempt so far to experimentally compare several gene finders. After testing the accuracy of Ensembl, SGP2 and TWINSCAN on the chicken genome we have shown that de novo comparative methods followed by experimental verification remain a successful approach in the annotation of newly sequenced genomes from which little is known.
We found that approximately 50% of predictions that were in TWINSCAN and SGP2 but not in Ensembl could be experimentally verified (Figure ). These experiments demonstrate that
de novo comparative prediction methods are effective at complementing homology-based methods and confirm that a combination of methods can improve the prediction accuracy [
18-
22]. Moreover, in spite of the limited gene sequence data available for chicken, the combination of TWINSCAN and SGP2 achieves better accuracy than previous attempts to verify by RT-PCR computational predictions that fall outside a set of annotations [
17,
23]. On the other hand, looking at the intron assemblies unique to one prediction set, the proportion of positives is largely reduced for predictions not in Ensembl. The predictions unique to one of the
de novo methods show an abundance of gene models with 2 and 3 exons, which may be artefacts due to genome misassemblies. These results are in contrast with the high success rate (77%) of the predictions unique to Ensembl. This is a reasonable observation considering that the Ensembl prediction pipeline has access to genes that do not follow a 'standard' gene-grammar (e.g., unusual codon usage), but which may nevertheless be represented in the cDNA/protein databases used.
The Ensembl chicken gene set has been found to have a 96% positive rate, whereas the IAs from the two-way intersections that include Ensembl, '(E and S) not T' and '(T and E) not S', and the Ensembl orphans, have a lower positive rate, 81%, 65% and 77%, respectively, which stems from the fact that most exons predicted by Ensembl are also predicted by both SGP2 and TWINSCAN. Additionally, de novo comparative methods are useful for extending partial predictions from homology-based methods. Ensembl may generate predictions based on protein fragments or on partial homology from other species, and TWINSCAN and SGP2 predictions can add bona fide exons to the Ensembl predictions they overlap with. For the 5' end we show that 40% of the tested cases, where either TWINSCAN or SGP2 predicted at least one additional exon, were verified (Table ). To our knowledge, this is the first time that experimental evidence is provided for extensions to homology-based models produced by de novo methods.
We observed that the subsets containing SGP2 IAs (e.g., '(S and E) not T)') have in general a higher proportion of RT-PCR positives than those containing TWINSCAN IAs (e.g., '(T and E) not S)') (Figure , Table ). There are two factors that may contribute to this difference. The first is an intrinsic difference between TWINSCAN and SGP2 – SGP2 uses TBLASTX (translated) alignments between human and chicken to reward exons overlapping aligned regions, whereas TWINSCAN uses BLASTN (nucleotide) alignments to influence the scores of exons, splice sites, and translation initiation and termination sites. Human and chicken are sufficiently diverged that translated alignments may be more sensitive, whereas nucleotide alignments fail to cover many known exons. The other factor is incidental to the way TWINSCAN was trained and run to produce the predictions tested. TWINSCAN used 525 chicken RefSeqs to estimate parameters for its probability model. This training set was probably too small to produce optimal parameter values. SGP2, on the other hand, was run with a combination of parameters estimated from the much larger set of known human genes (for its model of chicken DNA sequence) and parameters were hand tuned using the same 525 chicken genes (for its scoring of human-chicken alignments). Although a larger fraction of SGP2 predictions yielded positive experimental results, we found that TWINSCAN tends to be more accurate than SGP2 in the prediction of the intron boundaries (Figure , Table ). This difference stems from the intron model used by TWINSCAN, as opposed to SGP2, which does not model introns explicitly. TWINSCAN was re-run after completion of the experiments with an improved intron-length model, yielding a prediction set that was substantially smaller and more accurate (see Table S6) than the set tested. In spite these differences, comparing the gene predictions with a set of coding cDNAs released after the completion of these analyses, we found that all three methods have similar sensitivity (79%) (see Methods for details), hence the de novo comparative methods cover a fraction of the transcriptome similar to homology-based methods with a minimal initial amount of genome-specific expression data.
The experimentally verified IAs represent a fraction of the actual number of chicken genes that can be eventually found using our methods. If we extrapolate the proportions of experimentally verified IAs (Figure ) to all the generated IAs in the Venn diagram (Figure ) and using an average distribution of coding exons per gene from other vertebrates (Human, Mouse and Rat), we estimate a range of 14,600 to 17,500 experimentally verifiable chicken genes from our computational predictions. In this paper we only analysed intron assemblies and deliberately left out a number of chicken protein-coding intronless predictions (3,049 from SGP2, 2,727 from TWINSCAN and 1,855 from Ensembl). The triple intersection of these intronless genes contains 148 genes, which are worth investigating and for which techniques different from the ones applied here will be required.
Considering all 2-way IAs (see Table ), one would need 11,274 RT-PCR reactions to experimentally confirm about 7,232 (64%) genes. This number of experiments compares favourably to large-scale EST projects with the added benefit of having almost no redundancy (only gene fissions and misassemblies will contribute to redundancy). The biggest drawback to EST sequencing is its large redundancy and extensive overlap. The falling cost of primers and the increased flexibility of large-scale molecular biology centers make this approach of computational prediction followed by experimental verification cost effective and scalable [
5,
6]. As RT-PCR primers can be designed with appropriate linker sites such an approach could also provide a physical resource of clonable fragments. We conclude that
de novo comparative gene predictions followed by experimental verification is an effective way to carry out the annotation of a newly sequenced genome for which little gene sequence information is known. In particular, as our results show, performing RT-PCR and sequencing for all the predicted novel genes, starting with those predicted by multiple
de novo methods, should enhance the quality of the annotation in forthcoming eukaryote genome sequencing projects.