PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of wtpaEurope PMCEurope PMC Funders GroupSubmit a Manuscript
 
Bioinformatics. Author manuscript; available in PMC 2010 June 3.
Published in final edited form as:
PMCID: PMC2880447
EMSID: UKMS27365

Genomix

a method for combining gene-finders’ predictions, which uses evolutionary conservation of sequence and intron–exon structure

Abstract

Motivation

Correct gene predictions are crucial for most analyses of genomes. However, in the absence of transcript data, gene prediction is still challenging. One way to improve gene-finding accuracy in such genomes is to combine the exons predicted by several gene-finders, so that gene-finders that make uncorrelated errors can correct each other.

Results

We present a method for combining gene-finders called Genomix. Genomix selects the predicted exons that are best conserved within and/or between species in terms of sequence and intron–exon structure, and combines them into a gene structure. Genomix was used to combine predictions from four gene-finders for Caenorhabditis elegans, by selecting the predicted exons that are best conserved with C.briggsae and C.remanei. On a set of ~1500 confirmed C.elegans genes, Genomix increased the exon-level specificity by 10.1% and sensitivity by 2.7% compared to the best input gene-finder.

1 INTRODUCTION

The number of genomes being sequenced is increasing, making automatic prediction of genes all the more important (Brent, 2005). At present, the most accurate gene-finders are those that use transcript data to predict genes (Guigó et al., 2006). However, many genomes being sequenced lack extensive, or in some cases any, transcript data. For example, whole-genome sequencing projects for at least a dozen nematode species are under way (Liolios et al., 2006), and none of these will have nearly as much transcript data as the model organism Caenorhabditis elegans.

A straightforward way to improve gene-finding accuracy is to combine the results of several gene-finders, to take advantage of the fact that different gene-finders are good at predicting different types of genes. In this way, gene-finders that make uncorrelated errors can correct each other (Dietterich, 1997). Burset and Guigó (1996) realized this when they ran nine different gene-finders, and noticed that although the predictions were poorly correlated at the nucleotide level, only 1% of exons were missed by all the programs. A combined gene set is most likely to improve accuracy if the input gene-finders predict different sets of genes correctly and incorrectly (Ali and Pazzani, 1996).

Several different combiners have been inspired by Burset and Guigó’s observations (Allen et al., 2006; Howe et al., 2002; Murakami and Takagi, 1998; Pavlović et al., 2002; Schiex et al., 2001; Shah et al., 2003; Yada et al., 2003; Zhang et al., 2003; Supplementary Table 1). The combiner software jigsaw (Allen et al., 2006) performed as well or better than any of the other entries in the egasp competition (Guigó et al., 2006). In that competition, jigsaw’s predictions for the human encode regions had 81% sensitivity and 89% specificity at the exon level.

Combiners for gene-finders generally have two elements: a method of scoring predicted features such as exons or splice sites, and a method of combining the highest-scoring features into a gene structure. Most combiners score predicted exons using the confidence scores provided by the input gene-finders for the predicted exons. However, such confidence scores are often poorly correlated with exon-level accuracy (Rogic et al., 2001).

Several recently published combiners also use information on the conservation level of predicted features. For example, Zhang et al. (2003)’s combiner takes gene predictions, and predicted conserved splice sites, start and stop codons, and bases as input. Similarly, jigsaw can use predicted conserved regions as an input (Allen et al., 2006). The advantage of using predicted conserved features as input to a combiner is that sequence conservation is a good indicator of whether a predicted feature is likely to be real (Parra et al., 2003; Ureta-Vidal et al., 2003).

Here we present a new method for combining the results of several gene-finders, called Genomix. Genomix selects the predicted exons that are best conserved within and/or between species in terms of sequence and intron–exon structure, and combines them into a gene structure. Genomix differs from previous combiners that use conservation information in that it uses conservation as its primary score, and uses conservation of exon boundaries as well as of exon sequence.

Here we describe how Genomix was used to combine exons predicted in the C.elegans genome by four different gene-finders. We measured the accuracy of Genomix’s predictions on a test set of ~1500 confirmed C.elegans genes, and found that Genomix increases the exon-level specificity by 10.1% and sensitivity by 2.7% over the best input gene-finder. Compared to jigsaw, a state-of-the-art combiner (Allen et al., 2006), Genomix has higher specificity by 3.5% (Fisher’s test: P < 10−9) but slightly lower sensitivity (by 0.7%; Fisher’s test: P = 0.1) when the same core set of input predictions are used.

2 METHODS

Exon and gene predictions that have sequence similarity matches in related species are more likely to overlap real genes than those that lack similarity matches (Parra et al., 2003). We surmised that predicted exons that have intron–exon boundaries that are conserved with homologous exons are more likely to be correct than predicted exons that have non-conserved intron–exon boundaries. Thus, we have developed a method for combining the outputs of several gene-finders, by assuming that the most plausible predicted exons are those that are best conserved within or between species in terms of their sequence and intron–exon boundaries.

2.1 Input gene sets

We refer to the species for which we want to make gene predictions as the ‘query species’. To predict genes in a piece of genomic DNA from the query species (e.g. from C.elegans) or for its whole genome, our program requires gene predictions from multiple gene-finders for the DNA sequence. It takes GFF (Gene Feature Format; R. Durbin and D. Haussler; http://www.sanger.ac.uk/Software/GFF) files of gene predictions as input. It also requires predictions from multiple gene-finders for one or more homologous genomic regions, either from the same species and/or from one or more related species (e.g. C.briggsae and C.remanei). In the absence of genomic sequence from related species, it is possible to run Genomix by using input predictions from the query species alone. (In this case, Genomix selects the predicted exons that are best conserved between paralogous genes in the query species.)

2.2 Overall strategy of Genomix

The first step in our method is to divide the input DNA sequences from the query species and related species (if any) into regions that are likely to contain just one or a few genes. Then, for each genomic region defined in the query species:

  1. we identify its top homologous region, which can be either an orthologous region in a related species, or a paralogous region in the query species;
  2. for each predicted exon in the genomic region, we calculate a score that reflects its conservation relative to the exons in the top homologous region;
  3. we use dynamic programming to select the best conserved (top-scoring) predicted exons in the query region, and combine them into a gene structure.

We explain each of these steps in more detail below.

2.3 Exon clusters

Different gene-finders often split one gene into several predictions (for example, the snap predictions for ver-1 in Fig. 1), or merge several genes into one prediction. To divide a piece of input DNA into regions that are likely to contain one or just a few genes each, we followed the approach used by Murakami and Takagi (1998) to identify ‘clusters’ of predicted exons along the input DNA. That is, two or more exons on the same strand are put in the same ‘exon cluster’ if they overlap, or if more than one gene prediction program placed them together in a gene prediction. For example, an exon cluster may consist of overlapping predictions for the C.elegans ver-1 gene (Fig. 1). In practice each exon cluster usually contains 1-3 genes. Exon clusters are identified along each contig or chromosome in each species. We will refer to the exons in a query species exon cluster X as x1, x2 . . . xn, and the exons in an exon cluster Y in a related species as y1, y2 . . . ym.

Fig. 1
The predictions for the C.elegans ver-1 gene from different gene-finders form one ‘exon cluster’. The gene model shown at the top has been experimentally confirmed. Genomix aims to select the subset of predicted exons that are most likely ...

Gene-finders sometimes predict the incorrect reading frame for an exon. Thus, all of the exons in each exon cluster are translated in each of their three possible reading frames. Any exon translations that contain internal stop codons are discarded.

2.4 Finding matching exon clusters

For each exon cluster X in the query species, we identify matching exon clusters within and between species. In other words, we identify exon clusters corresponding to genes that are paralogs or orthologs of the gene(s) in the query exon cluster. To do this, we run blastp (Altschul et al., 1997) to compare the exon translations from the query species (e.g. C.elegans) to a database of all translated exons from the query species and related species (e.g. C.briggsae and C.remanei). blastp is run using an E-value cut-off of 0.1.

In general, the exons in a query exon cluster have blastp hits to exons from several different exon clusters from the query species and related species. A score is assigned to the match between a query exon cluster and each of its matching exon clusters, by summing the bit scores for the highest scoring blastp hits (hsps) between their exons. For each query exon cluster, the top-scoring matching cluster is identified, and the others are discarded. The top-scoring matching cluster could correspond to a gene that is either a paralog or an ortholog of the gene in the query exon cluster.

2.5 Exon sequence conservation

We can rapidly identify matching exon clusters by using blastp to search for matching exons. The blastp bit score gives a rough measure of the sequence conservation between a predicted exon xi in a query exon cluster X and an exon yj in the top matching exon cluster Y. However, blastp does not give a very accurate measure of sequence conservation, especially for short or poorly conserved exons. Furthermore, as Zhang et al. (2003) pointed out, blast does not identify the limits of the region of similarity between two exons very well.

To obtain a more accurate measure of the sequence conservation between each pair of exons xi and yj, we use prss (version 3.4t25; Pearson, 1996) to calculate a P-value Pij for the significance of the Smith-Waterman alignment between their amino acid sequences (Fig. 2). We use the blosum50 scoring matrix and 200 shuffles in prss, and since indels are relatively rare within coding exons, high gap-open and gap-extension penalties (−15 and −10) are used. The prss P-value is transformed into a sequence similarity score S1ij for exons xi and yj using the equation:

S1ij=log10(Pij)b1

where b1 is a constant that determines the threshold used for the prss P-value. For example, a value of 2 for b1 corresponds to a prss P-value threshold of 10−2. Our method uses parameters b1-b10 and we discuss how the values for these were set below.

Fig. 2
Measuring exon conservation. (A) Four different gene-finders predict different coordinates for the fifth exon of C.elegans gene X. (B) Genomix calculates a score for each of the four predicted C.elegans exons, which reflects how conserved is its sequence, ...

2.6 Intron–exon boundary conservation

If the sequence similarity score for exons xi and yj is significant (S1ij > 0), we also calculate an ‘exon-boundary similarity score’ (S2ij), which reflects how well the exon boundaries of exons xi and yj are conserved:

S2ij=Itb2+(Is+Ie)b3+30((Os+Oe)b4)

where:

  1. It is the total number of positions in the prss alignment of exons xi and yj that are identical and are flanked by two identical positions and b2 is a constant;
  2. Is and Ie are the numbers of identities in the first ten and last ten prss alignment positions, and b3 is a constant;
  3. Os and Oe are the number of overhanging residues at the start and end of the prss alignment and b4 is a constant. We only penalize up to 15 overhanging residues at the start or end: if Os > 15 then we set Os = 15, and if Oe > 15 we set Oe = 15.

If exons xi and yj have high sequence identity, and if there are no overhanging residues at the start (or end) of the prss alignment, then the intron–exon boundary at the start (or end) of exons xi and yj is inferred to be highly conserved.

2.7 Exon phases and length conservation

If exons xi and yj have significant sequence similarity (S1ij > 0), we calculate a ‘phase similarity score’ S3ij that reflects to what extent their phases are conserved. We will use the term ‘exon start phase’ to refer to the phase of an exon’s 5′-flanking intron, and the term ‘exon end phase’ to refer to the phase of the exon’s 3′-flanking intron. S3ij is initialized to zero, and is increased by a constant b5 if the start phases of exons xi and yj are the same and/or by b5 if their end phases are the same.

Even if an exon’s sequence is not conserved, its length may still be conserved. Thus, if the sequence similarity score for exons xi and yj is not significant (S1ij ≤ 0), a score S4ij is calculated that reflects whether their lengths are conserved. If the difference between their lengths is ≤3 bp, we set S4ij equal to a constant b6 (otherwise S4ij = 0).

2.8 Total conservation score

We calculate a total conservation score Sij for exons xi and yj as:

Sij=S1ij+S2ij+S3ij+S4ij

The matrix of scores S describes the similarities between each exon in a query exon cluster and each exon in a matching exon cluster. Matrix S is used as a scoring matrix in the dynamic programming step of Genomix, as described below.

2.9 Dynamic programming

For each query exon cluster, dynamic programming is used to search for the set of predicted query exons whose added conservation scores are maximized under their reading frame constraints (Fig. 3). The exons in the query exon cluster and in its top matching exon cluster are first sorted from 5′ to 3′. We then proceed to fill the dynamic programming matrix D from top left to bottom right, by calculating Dij as:

Dij=Sij+max(Dkl+B7+B8+B9+B10)

where:

  1. ki and lj, but !(k = i, l = j).
  2. exon xk must be in the same reading frame as exon xi, and exon yl must in the same reading frame as exon yj
  3. if k < i, then query exon xk must end >30 bp before query exon xi begins in the chromosomal DNA. This prevents us from predicting introns of <30 bp, which would probably be shorter than the minimum length for an intron to be spliced correctly (Deutsch and Long, 1999). Likewise, if l < j, then exon yl must end >30 bp before exon yj begins in the chromosomal DNA.
  4. if (k = i, l! = j), then the prss alignment between the amino acid sequences of exons xi and yl, and the prss alignment between exons xi and yj, must not overlap with respect to the amino acid sequence of exon xi.
  5. B7 is a score designed to favour exons that were predicted by the same input gene-finder. If k! = i and if exons xk and xi were predicted as adjacent exons in a gene by at least one of the input gene-finders, then B7 = b7, where b7 is a constant. B7 = 0 otherwise.
  6. B8 is a score designed to favour predicted conserved initial exons. If exons xk and yl have significant sequence conservation (S1kl > 0), and if they are both possible initial exons (start in phase 0 with Met), then B8 = b8, where b8 is a constant. Otherwise B8 = 0.
  7. B9 is a score designed to favour predicted conserved terminal exons. If exons xi and yj have significant sequence conservation (S1ij > 0), and if they are both terminal exons (end in phase 0 with a stop codon), then B9 = b9, where b9 is a constant. B9 = 0 otherwise.
  8. B10 is a penalty designed to disfavour a very long predicted intron between exons xk and xi that is much longer than the median C.elegans intron length (65 bp). B10 = 0 if the intron length is <250 bp; and B10 = 0.75*b10 if the intron is 250-500 bp, where b10 is a constant. For introns larger than 500 bp, B10 increases in steps of b10 every 250 bp, from B10 = b10 for introns of 500-750 bp to B10 = 5*b10 for introns of ≥1500 bp.
Fig. 3
Using dynamic programming to select predicted exons. (A) To select the best subset of exons predicted in the C.elegans query exon cluster (here exon cluster 18196, which corresponds to the C.elegans ver-1 locus), dynamic programming is used to find the ...

After calculating matrix D, the optimal alignment between the exons in the query exon cluster and the exons in the top matching exon cluster is found by tracing back through matrix D. We start at the cell (i,j) that has the max(Dij). This is the best score for an alignment between predicted exons x1 . . . xi from the query exon cluster and predicted exons y1 . . . yj from the top matching exon cluster. We trace back through the matrix until a cell for which Dij ≤ 0 is reached. The exons selected by the dynamic programming algorithm form Genomix’s final gene prediction(s) for the query exon cluster. Note that this process can, and frequently does, generate multiple genes and can also generate partial genes, although these are rarer. To make gene predictions for a whole genome, for example, for C.elegans, dynamic programming is used to select the best predicted exons in each query exon cluster, by comparison to the top matching exon cluster, for example from C.briggsae, C.remanei or C.elegans.

Dynamic programming can also be used to select the best predicted exons in a query exon cluster by comparison to two, three or more top matching exon clusters. For example, we can select the best conserved predicted exons in the C.elegans ver-1 locus by comparison to the predicted exons in the C.remanei ver-1 locus using 2D dynamic programming (Fig. 3), or by comparison to the predicted exons in both the C.remanei ver-1 and C.briggsae ver-1 loci by using 3D dynamic programming.

In a post-processing step, Genomix discards gene predictions that lack any high-scoring exon (having exon score S ≥ 1). This is designed to discard gene predictions that correspond to (pseudogenes or gene fragments.

2.10 Output gene set

The output gene set is in GFF format. A score S is given to each exon predicted by Genomix: this is the maximum conservation score Sij observed for exon xi when compared to all other exons yj in the top matching exon cluster.

2.11 Optimizing Genomix parameters

The parameters b1-b10 were hand-tuned by starting with prior estimates and iteratively adjusting these parameters to improve Genomix’s prediction accuracy on the training set of 381 genes (described below). Using this manual tuning strategy, we chose parameter values of b1 = 1, b2 = 3, b3 = 0.3, b4 = 1.5, b5 = 5, b6 = 2, b7 = 1, b8 = 3, b9 = 10, and b10 = −0.2.

2.12 Training set and test set

A total of 4025 C.elegans genes that each have one coding transcript that has been confirmed by mRNA or EST alignment (and are not known to be alternatively spliced) were downloaded from WormBase (version WS147; Schwarz et al., 2006). Out of these genes, 381 were randomly chosen as a training set for tuning parameters during development of our algorithm. Excluding the training set genes, 1534 of the remaining genes were randomly selected as a test set for assessing the accuracy of Genomix. Note that not all of the 4025 confirmed genes were used for the training and test sets; some were kept back in case an extra test set was needed. To measure Genomix’s specificity in intergenic regions, we randomly selected 1179 intergenic regions from the ~20 000 intergenic regions in the whole genome. Any predicted exons in gene predictions that lie completely in these intergenic regions were counted as false positives. The test set genes and intergenic regions span ~6% of genic DNA and ~6% of intergenic DNA in C.elegans.

2.13 Input gene sets

We analysed C.elegans gene predictions for the WormBase WS147 release of the genome (Schwarz et al., 2006). twinscan predictions (Korf et al., 2001) were downloaded from http://mblab.wustl.edu. Several other gene-finders were also run on the C.elegans genome: Genefinder (release 980504; P. Green, unpublished data), fgenesh (Salamov and Solovyev, 2000), and snap (version 2005-10-24; Korf, 2004). For fgenesh the nematode-specific trans-splicing (-n) option was used.

We analysed gene predictions for the cb25.agp8 C.briggsae genome assembly (Stein et al., 2003). C.briggsae gene sets made using fgenesh and Genefinder during the C.briggsae genome project (Stein et al., 2003) were downloaded from WormBase. twinscan predictions were downloaded from http://mblab.wustl.edu.

Gene predictions based on version 041227 of the C.remanei genome were analysed (GenBank accession AAGD01000000). Gene sets that were made as part of the C.remanei genome project using fgenesh, Genefinder and snap were downloaded from WormBase. twinscan predictions were downloaded from http://mblab.wustl.edu.

2.14 Jigsaw predictions

Version 3.2.5 of jigsaw (Allen et al., 2006) was downloaded from http://www.cbcb.umd.edu/. We made two different gene sets using jigsaw, each with different input data:

  1. fgenesh, Genefinder, snap and twinscan predictions for C.elegans;
  2. The four gene sets, plus blat alignments of C.elegans mRNAs to the C.elegans genome, and predicted splice sites.

We will refer to these gene sets as jigsaw_1 and jigsaw_2. The positions of the best blat alignments of C.elegans mRNAs to the genome were downloaded from WormBase (release WS147). To predict the positions of splice sites in the C.elegans genome, the Genefeatures program by R. Durbin was used. Genefeatures is based on part of the Genefinder software (P. Green, unpublished data) and is available as part of the AceDB database software (Durbin and Thierry-Mieg, 1994 and http://www.acedb.org). Predicted splice sites that were assigned confidence scores of ≥3.0 by Genefeatures were used as input for jigsaw. For each of the two jigsaw gene sets, jigsaw was trained using the ‘oblique splits’ option on our training set of 381 fully confirmed C.elegans genes and then was run on the whole C.elegans genome.

3 RESULTS

The accuracy of Genomix was assessed on a test set of 1534 randomly chosen C.elegans genes, and 1179 randomly chosen intergenic regions (see Methods Section). These cover 6% of the total genic DNA and 6% of the intergenic DNA in the C.elegans genome. Since we do not expect any genes in the intergenic regions, predictions within these regions are counted as false positives.

3.1 Combining four gene-finders using Genomix

Genomix was used to combine predictions from fgenesh (Salamov and Solovyev, 2000), Genefinder (P. Green, unpublished data), snap (Korf, 2004) and twinscan (Korf et al., 2001) for C.elegans. Conservation with C.elegans paralogs and homologs from C.briggsae and C.remanei, was used to identify the most conserved C.elegans predicted exons. That is, each C.elegans exon cluster was compared to its top matching exon cluster from either C.elegans, C.briggsae or C.remanei.

Gene prediction accuracy was measured in terms of sensitivity and specificity at the nucleotide, exon and gene level (Burset and Guigó, 1996; Table 1). The exon-level sensitivity is the fraction of real exons predicted correctly by a gene prediction program. For an exon prediction to be considered correct, both the 5′ and 3′ boundaries must match the true exon exactly. The exon level sensitivities of the input gene-finders were 86.4% for fgenesh, 86.7% for Genefinder, 84.2% for snap and 88.8% for twinscan (Table 1). Only 96.4% of the real exons in the test set of 1534 C.elegans genes were predicted correctly by at least one of fgenesh, Genefinder, snap or twinscan. As for other combiners that focus on exons as the indivisible elements to be combined, it is impossible for Genomix to predict an exon that was missing from all the input predictions. As a result, Genomix’s maximum possible sensitivity is 96.4%. Genomix has 91.5% sensitivity at the exon level, i.e. it has 2.7% higher sensitivity than the most sensitive of the four input gene-finders. Furthermore, Genomix outperforms the individual gene-finders in terms of gene-level sensitivity by 6.4% (increasing gene-level sensitivity from ≤62.8 to 69.2%). Here a test set gene is considered to have been predicted correctly if the prediction made for the gene contains all of its real exons and no false-positive exons.

Table 1
Comparison of the accuracy of the input gene sets to that of the combined gene sets made by Genomix and jigsaw, for a test set of 1534 confirmed C.elegans genes and 1179 intergenic regions

The exon-level specificity is the fraction of the predicted exons that are real. Taking into account false-positive whole-gene predictions completely contained in the 1179 test set intergenic regions, the exon level specificities for the input gene-finders were 72.3% for fgenesh, 72.7% for Genefinder, 73.3% for snap and 77.2% for twinscan (Table 1). Genomix has 87.3% exon-level specificity, which is 10.1% higher than that of twinscan, the input gene-finder with the highest specificity. In addition, in terms of gene-level specificity Genomix performs better than the best of the input gene-finders, twinscan, by 9.7%.

3.2 Comparison to jigsaw

The accuracy of Genomix was compared to that of jigsaw (Allen et al., 2006), a combiner software that performed as well or better than any of the other entries in the egasp competition (Guigó et al., 2006). For egasp, Allen et al. (2006) used jigsaw to combine six different gene-finders, mRNA and EST alignments, curated genes and predicted conserved elements.

To see how well jigsaw performs using just raw gene predictions as input, a jigsaw gene set was made by combining the fgenesh, Genefinder, snap and twinscan predictions (‘jigsaw_1’). Jigsaw had 92% exon-level sensitivity and 82% exon-level specificity. This is roughly the same as the exon-level sensitivity of Genomix (92%), but slightly less than Genomix’s exon-level specificity (87%; Table 1).

However, while jigsaw’s input can be restricted to the outputs of several gene-finders, it performs better if its input also includes other data sources, especially alignments of mRNAs (Allen et al., 2006). Thus, to make a fair comparison of jigsaw’s and Genomix’s accuracies when supplied with their optimal input data, a second jigsaw gene set was made by combining the four input gene-finders, plus predicted splice sites and mRNA alignments. At the exon level, this second jigsaw gene set (‘jigsaw_2’) had 92.2% sensitivity and 83.8% specificity. Compared to Genomix, the jigsaw_2 gene set has 0.7% higher exon-level sensitivity (Fisher’s test: P = 0.1; McNemar’s test: P = 0.02), but 3.5% lower exon-level specificity (Fisher’s test: P < 10−9).

3.3 Suggested changes to WormBase curated genes

Genomix sometimes assigns a higher score to a predicted exon than to an overlapping confirmed exon in the test set. We surmised that some of these high-scoring predicted exons could be extensions to existing confirmed exons, or alternative transcripts. To investigate this, we examined exons to which Genomix assigns a score that is >100 higher than the score that it assigns to the overlapping test set exon. For example, Genomix assigns a score of 215 to the confirmed initial exon of T20D3.5, but assigns a score of 401 to an overlapping exon predicted by fgenesh and twinscan. In its final prediction for that locus, Genomix selected the fgenesh/twinscan exon and also selected an additional upstream exon. We realized that, compared to the confirmed transcript from WormBase WS147, the fgenesh/twinscan exon and extra initial exon align further upstream to the orthologs in human and fly. This information was sent to WormBase, who curated our suggested gene structure as alternative isoform T20D3.5b and renamed the original confirmed transcript as T20D3.5a (Fig. 4).

Fig. 4
An alternative isoform of a WormBase curated gene that was suggested by Genomix. (A) WormBase release WS147 contained one confirmed transcript for gene T20D3.5, now known as T20D3.5a. Genomix suggested an alternative isoform T20D3.5b. The grey box indicates ...

We also examined exons predicted by Genomix that did not overlap any curated WormBase WS147 exon, but that had high Genomix conservation scores of >100. There were 4373 such conserved exons, belonging to 2418 gene predictions that do not overlap any curated WormBase gene or annotated pseudogene. Of these 2418 putative new genes, 106 have ≥3 exons with conservation scores of >100, have valid start and stop codons, and contain coding DNA that is <10% repetitive. Two examples that we examined had already been added to WormBase since release WS147 (independently of our predictions): Y46E12A.4 and T07C4.11. A third example was a novel C.elegans gene prediction that does not have similarity to any curated C.elegans gene but shows high sequence conservation (61% identity) with the C.briggsae and C.remanei orthologs predicted by Genomix. We informed the WormBase curators about this putative gene, and it has been accepted into the next WormBase release (WS169) as R01H2.8. We are compiling a list of the most convincing new genes and exons suggested by Genomix to give to WormBase.

4 DISCUSSION

Predicting genes in eukaryotic DNA is still a challenge for the best gene-finders. We have presented a method for combining predictions from two or more gene-finders, which successfully improves prediction accuracy. Genomix should prove useful in providing accurate gene predictions for the increasing number of genomes that are being sequenced, and especially where transcript data is not available to aid gene prediction, for example in the many groups of related invertebrate species currently being sequenced. These include arthropods (12 Drosophila species), platyhelminths (Schistosoma mansoni and S.japonicum), cnidarians (the sea anemone Nematostella and coral Acropora), mollusks (bivalves Spisula and Mytilus and gastropods Lottia, Aplysia and Biomphalaria), and annelids (Helobdella and Capitella) (Liolios et al., 2006).

In a large test set of C.elegans genes, Genomix improves the exon-level sensitivity by 2.7% and the exon-level specificity by 10.1% compared with the input gene-finders. Genomix’s high exon-level specificity is due to its approach of scoring predicted exons according to their conservation, which allows it to discard many false positive non-conserved exons that were predicted by the input gene-finders. Genomix also correctly predicts some exons that the most accurate gene-finder misses. At the exon level twinscan has 2.1% higher sensitivity and 3.9% higher specificity than fgenesh, Genefinder or snap, but Genomix can still improve on twinscan’s accuracy by combining it with the other three gene-finders (Table 1). In fact, despite the fact that Genomix does not make use of EST or mRNA data, its exon-level accuracy is close to the 92% sensitivity and 87% specificity achieved by twinscan_est (on a different data set), which uses EST data (Wei and Brent, 2006).

The recent combiner glean (Elsik et al., 2007) is the first combiner for gene-finders that does not require any training set. This is a huge advantage for annotating newly sequenced genomes that lack EST or cDNA resources. In contrast, in order for Genomix to achieve its best performance on a new genome, its parameters b1, b2 . . . b10 should be re-tuned for the new genome. However, we found that Genomix performs quite well on an unseen genome even if parameters b1, b2 . . . b10 are not re-tuned. We used the version of Genomix that was tuned for C.elegans to make gene predictions for D.melanogaster by combining the output of three different gene-finders, and found that Genomix increased exon-level sensitivity by 2.3% and exon-level specificity by 3.5% over the best input gene-finder (Supplementary Table 2). That is, while it is advisable to re-tune Genomix parameters, in the absence of any training data it is safe to assume that Genomix will produce reasonably good predictions without re-tuning. This is probably due to the fact that Genomix tries to avoid the need for a large training set by using prss (Pearson, 1996) to directly estimate the probability that a predicted exon overlaps a real coding exon. That is, we use prss to calculate the P-value for the significance of the alignment between the amino acid sequence of a predicted exon and a homologous exon from the same or a related species. The P-values calculated by prss for amino acid sequences are very accurate (Brenner et al., 1998; Pearson, 2000), and so give us a reliable estimate of the probability that a predicted exon has conserved amino acid sequence. As a result, predicted exons and genes that are assigned non-significant P-values by prss are often wrong (do not overlap any real exon). This probably underlies Genomix’s good performance with respect to completely wrong exon predictions (WE = 8.1%) and completely wrong gene predictions (WG = 2.7%; Table 1).

The fact that Genomix assigns a conservation score to each predicted exon separately makes it possible to use input gene sets that are correlated without biasing the results. This is because an exon predicted in correlated gene sets will only be assigned a high score if it is highly conserved. As a result, it is possible, for example, to combine predictions from a single gene-finder using two different parameter settings.

Genomix only works for genes that have sequenced homologs in the same species or in a closely related species. However, this does not affect sensitivity too badly. While Genomix does completely miss some whole genes that are not conserved, these are only ~3.2% of test set genes (Genomix’s total MG = 3.5%, of which 0.3% is due to genes completely missed by the four input gene-finders). Likewise, Genomix completely misses some non-conserved exons in otherwise conserved genes, but this only occurs for ~1% of test set exons.

Many genes in animal genomes are thought to undergo alternative splicing (Kan et al., 2001). Ten percent of C.elegans genes are annotated as having alternative splicing in the coding region, and this is probably an underestimate. Predicting alternative splicing is a very important challenge for the gene-finding field (Guigó et al., 2006). So far, only a few combiner programs such as EuGène (Foissac and Schiex, 2005) predict alternative transcripts. In its current implementation, Genomix predicts only a single isoform for each gene. The test set analysed here consists of genes that each has a single isoform. However, for genes that do undergo alternative splicing, Genomix will miss many real alternative isoforms. An important future direction is to develop Genomix so that it can predict alternative isoforms, for example by making use of mRNA/transcript data.

Supplementary Material

Supplementary Data

ACKNOWLEDGEMENTS

This project was supported by The Wellcome Trust and an EMBO long-term fellowship to A.C. We thank John Spieth (Washington University GSC) for kindly allowing us to use C.remanei sequence data and gene predictions, Bill Pearson for advice on running prss, Jonathan Allen for advice on running jigsaw, Venky Iyer for Drosophila gene sets, and Des Higgins (University College Dublin) for hosting A.C. in his group during part of this work. We are grateful to members of the Durbin research group and WormBase for useful discussions, and to two anonymous reviewers, as well as to Noel O’Boyle, Alan Moses, Jean-Karim Hériché, Li Heng and David Carter for helpful comments on the manuscript.

Footnotes

Conflict of Interest: none declared.

Availability: Scripts and Supplementary Material can be found at http://www.sanger.ac.uk/Software/analysis/genomix

Supplementary information: Supplementary data are available at Bioinformatics online.

REFERENCES

  • Ali KM, Pazzani MJ. Error reduction through learning multiple descriptions. Machine Learning. 1996;24:173–206.
  • Allen JE, et al. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biol. 2006;7(Suppl. 1):S9. [PMC free article] [PubMed]
  • Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
  • Brenner SE, et al. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA. 1998;95:6073–6078. [PubMed]
  • Brent MR. Genome annotation past, present and future: how to define an ORF at each locus. Genome Res. 2005;15:1777–1786. [PubMed]
  • Burset M, Guigó R. Evaluation of gene structure prediction programs. Genomics. 1996;34:353–367. [PubMed]
  • Deutsch M, Long M. Intron-exon structures of eukaryotic model organisms. Nucleic Acids Res. 1999;27:3219–3228. [PMC free article] [PubMed]
  • Dietterich TG. Machine-learning research: four current directions. The AI Magazine. 1997;18:97–136.
  • Durbin R, Thierry-Mieg J. The ACeDB Genome Database. In: Suhai S, editor. Computational Methods in Genome Research. Plenum Press; New York: 1994. pp. 45–56.
  • Elsik CG, et al. Creating a honey bee consensus gene set. Genome Biol. 2007;8:R13. [PMC free article] [PubMed]
  • Foissac S, Schiex T. Integrating alternative splicing detection into gene prediction. BMC Bioinformatics. 2005;6:25. [PMC free article] [PubMed]
  • Guigó R, et al. EGASP: the human ENCODE genome annotation assessment project. Genome Biol. 2006;7(Suppl. 1):S2. [PMC free article] [PubMed]
  • Howe KL, et al. GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res. 2002;12:1418–1427. [PubMed]
  • Kan Z, et al. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res. 2001;11:889–900. [PubMed]
  • Korf I. Gene finding in novel genomes. BMC Bioinformatics. 2004;5:59. [PMC free article] [PubMed]
  • Korf I, et al. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001;17(Suppl. 1):S140–S148. [PubMed]
  • Li H, et al. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 2006;34:D572–D580. [PMC free article] [PubMed]
  • Liolios K, et al. The genomes on line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 2006;34:D332–D334. [PMC free article] [PubMed]
  • Murakami K, Takagi T. Gene recognition by combination of several gene-finding programs. Bioinformatics. 1998;14:665–675. [PubMed]
  • Parra G, et al. Comparative gene prediction in human and mouse. Genome Res. 2003;13:108–117. [PubMed]
  • Pavlović V, et al. A bayesian framework for combining gene predictions. Bioinformatics. 2002;18:19–27. [PubMed]
  • Pearson WR. Effective protein sequence comparison. Methods Enzymol. 1996;266:227–258. [PubMed]
  • Pearson WR. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 2000;132:185–219. [PubMed]
  • Rogic S, et al. Evaluation of gene-finding programs on mammalian sequences. Genome Res. 2001;11:817–832. [PubMed]
  • Salamov AA, Solovyev VV. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 2000;10:516–522. [PubMed]
  • Schiex T, et al. EUGENE: An eukaryotic gene finder that combines several sources of evidence. Lecture Notes in Computer Science. 2001;2066:111–125.
  • Schwarz EM, et al. WormBase: better software, richer content. Nucleic Acids Res. 2006;34:D475–D478. [PMC free article] [PubMed]
  • Shah SP, et al. Genecomber: combining outputs of gene prediction programs for improved results. Bioinformatics. 2003;19:1296–1297. [PubMed]
  • Stein LD, et al. The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol. 2003;1:E45. [PMC free article] [PubMed]
  • Ureta-Vidal A, et al. Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nat. Rev. Genet. 2003;4:251–262. [PubMed]
  • Wei C, Brent MR. Using ESTs to improve the accuracy of gene prediction. BMC Bioinformatics. 2006;7:327. [PMC free article] [PubMed]
  • Yada T, et al. DIGIT: a novel gene finding program by combining gene-finders. Pac. Symp. Biocomput. 2003;8:375–387. [PubMed]
  • Zhang L, et al. Human-mouse gene identification by comparative evidence integration and evolutionary analysis. Genome Res. 2003;13:1190–1202. [PubMed]