The overall strategy (described in Materials and methods, below) in creating an optimal honey bee gene list involved comparing a variety of gene lists with a set of manually annotated genes (which had not been included in the gene prediction evidence) and determining which gene set was superior based on two metrics. Five different sets of gene predictions were used as input to GLEAN and the output represented the sixth gene set. Two evaluations were performed. The first evaluation, to determine the utility of GLEAN, was based on a comparison with a set of 395 manually annotated gene models. These gene models were created by members of the honey bee research community using the genome assembly along with EST and cDNA sequences under study in various laboratories but not yet submitted to a public database. The EST and cDNA sequences used to construct the 395 gene models were not available to the contributors of the input gene prediction sets and were purposely omitted as evidence in generating the GLEAN consensus set. These sequences were arbitrarily selected based on availability in the community, and there were no known biases in this collection of genes. The GLEAN consensus and input gene sets were compared with these manual annotations using two metrics: the number of genes showing identical matches and the number of genes showing any match of 95% identity or greater.
Although the manually annotated gene models used in the first evaluation were high quality because of their cDNA origin, they did not allow computation of sensitivity and specificity, because they were located randomly throughout the genome. A second evaluation, using expert annotated gene models from entire scaffolds, was used to compare the sensitivity and specificity of GLEAN with those of the input gene sets. This second set of manually annotated gene models relied on protein homology and gene prediction evidence as well as cDNA evidence. Finally, the gene prediction sets were compared with spliced EST alignments to determine congruency in donor/acceptor sites.
Initial evaluations (Table ) suggested that the GLEAN consensus set was superior to the individual gene sets. The merged GLEAN gene set had fewer gene models than most of the sets, yet it had the greatest number of perfect alignments and the highest fraction of perfectly aligned gene models. The GLEAN set had the second greatest number of genes showing any match (surpassed only by the Fgenesh set, which had three times as many gene models as GLEAN) and the greatest fraction of genes showing a match (equaling the NCBI gene list for this statistic). Thus, by these two tests the GLEAN gene set was judged to be the optimal one, with an increased number of known genes. Further evaluations described below showed that, in terms of quality, GLEAN was equal to or superior to the best gene prediction set.
Characteristics of gene sets
General characteristics of the gene sets are shown in Table . GLEAN was most similar to the NCBI set in terms of gene length and transcript length. The number of single exon genes in the GLEAN set (705) was more similar to the number in the Fgenesh set (882) than to the NCBI set (194). Table illustrates a challenge encountered by many gene prediction algorithms in predicting start and stop sites. GLEAN performed among the best in the proportion of complete transcripts, and only 13 of the 10,157 GLEAN gene models lacked stop codons.
General Statistics for GLEAN and input gene prediction sets.
Contributions of individual sets to the consensus set
The representation of each gene set in the consensus set is shown in Table , using different criteria to identify overlapping gene models. The most relaxed to most stringent criteria are 80% overlap on at least one sequence, 80% overlap on both sequences, and exact match. Table shows that NCBI and Fgenesh contributed to the greatest number of GLEAN gene models and exons. A more important issue might be the number of GLEAN gene models that have representation by only one set. These are the genes that would not be represented in nonconsensus sets. Table shows the number of GLEAN genes models and exons represented by only one set, using the previously mentioned overlap criteria. A notable point is that a number of transcripts and exons was contributed by Fgenesh, the ab initio program. This illustrates a benefit of GLEAN, in that it can exploit the high sensitivity of a dataset that has low specificity.
Number (%) of GLEAN transcripts and exons with overlap to gene prediction sets
Number (%) GLEAN transcripts and exons with overlap to only one gene prediction set
Sensitivity and specificity are shown in Tables and for different levels of comparison. Sensitivity and specificity were evaluated based on exact match at the gene level, transcript level, exon level, and nucleotide level. The evaluation using chromosome 15/16 manual annotations (Table ) suggested that GLEAN was superior to all of the gene sets in all measures.
Sensitivity and specificity using 684 manual gene models chromosomes 15 and 16
Sensitivity and specificity using 33 manual gene models from scaffold 1.16
We were wary of potential observer bias because the GLEAN set was visible to the annotator when the chromosome 15/16 set was annotated. Although instructed to ignore the GLEAN models, the annotator was still able to see the GLEAN models in the chromosome 15/16 annotation, and thus might annotate genes more 'favorably' for GLEAN. To check for observer bias, the annotator created gene models on an additional scaffold without viewing the GLEAN set (Table ). If observer bias was truly present, then we would expect GLEAN to perform poorly compared with other predictors in the scaffold evaluation.
Several of the gene sets, including GLEAN, performed poorly on the scaffold compared with the chromosome 15/16 evaluation. A possible explanation is that the performance estimates were based on a smaller number of genes on the scaffolds, and so the scaffold estimates would have greater confidence intervals (be less accurate) than the chromosome 15/16 estimates. However, what remained true is that GLEAN performed as well as or better than the other predictors in the scaffold evaluation. Furthermore, the performance not only of GLEAN but also of the other predictors decreased in the scaffold evaluation; thus, it is more likely that GLEAN's superior performance on chromosome 15/16 was not due to observer bias, as compared with an outcome in which the other predictions fare better than GLEAN in the scaffold evaluation.
Among the prediction sets, GLEAN was most congruent with aligned ESTs (Table ). GLEAN had the greatest number of donor/acceptor splice matches to internal EST donor/acceptor sites (perfect introns), and performed among the best in the proportions of perfect donor/acceptor matches to the number of internal EST donor/acceptor sites and the total number of predicted donor/acceptor sites.
Comparison of gene prediction sets with spliced EST alignments
The number of genes in honey bee
The honey bee consensus set represented a larger number of genes than were present in the NCBI set, which performed the best of all of the input sets in terms in sensitivity and specificity. However, the difference in gene number was not drastic. The consensus gene set was still heavily biased to the AT-rich regions of the honey bee genome [1
]. It is reasonable to think that the combined input gene prediction programs do not represent all of the genes in the honey bee genome, and therefore the consensus set could not represent all of the genes. However, manual inspection of gene families represented in the consensus set and a tiling array experiment suggest that most genes are represented [1
]. While very large genes with exons located on different scaffolds would not be predicted as complete genes, their exons would be identified as separate genes in the consensus set. Thirteen genes that crossed scaffolds were identified among 2502 manually annotated genes [2