2.1 Reverse scoring
The IMM scoring algorithm in Glimmer computes the log-likelihood that a given interval on a DNA sequence was generated by a model of coding versus noncoding DNA. This model represents the probability of a nucleotide given a subset of positions in a window (called the context) adjacent to the nucleotide—for details, see (Delcher et al., 1999
; Salzberg et al., 1998
). Glimmer 3.0 takes advantage of the flexibility of this algorithm by scoring all open reading frames (ORFs) in reverse, from the stop codon back toward the start codon, with the probability of each base conditioned on a context window on its 3′ side and the score of the ORF being the log-likelihood sum of the bases contained in the ORF. The score is computed incrementally as a cumulative sum at every codon position in a given ORF. In many cases, these scores show a marked peak in value, and furthermore this peak typically occurs near the correct start site (see for an example). The advantage of scanning ORFs in reverse is that for nucleotides near the start site, the context window of the IMM is contained within the coding portion of the gene, which is the type of data on which it was trained. This results in a more precise cumulative score at nucleotides very close to the start site, compared to a context window on the opposite side which would intersect a non-coding region.
Fig. 1 Scoring an open reading frame from the stop codon backwards. The stop codon is at position 0 on the X-axis and the cumulative log-odds score is plotted as the solid line. Positions of possible start codons are indicated by vertical dashed lines. This (more ...)
As the figure shows, the cumulative IMM score steadily increases as we move away from the stop codon at the left until it reaches a peak and then begins to decline. The decrease occurs because the bases upstream of the gene start codon are non-coding and produce negative IMM log-odds scores. The figure also shows, as vertical dashed lines, the positions of all possible start codons in the ORF. We hypothesized that if we used the highest-scoring start codon in these plots, then Glimmer would find a higher percentage of true start sites. This is borne out in our experiments, described subsequently.
An important difference between this algorithm and earlier versions of Glimmer is that, unlike those versions, which had a strong bias in favor of longer ORFs, this algorithm chooses start sites based on their relative scores.
2.2 Ribosome binding sites
In previous versions of Glimmer, the ribosome binding site (RBS) was essentially ignored, even though it provides a strong signal for the position of the true start site. We addressed this problem with a standalone program, RBSfinder, that can be run as a post-processor on the results of Glimmer's analysis. RBSfinder is quite effective at finding ribosome binding sites and adjusting Glimmer's predictions (Suzek et al., 2001
), but we nonetheless felt that a better design would integrate RBS evidence directly into the gene-finding algorithm. Glimmer3 now contains this long-awaited integration.
After experimenting with several alternative algorithms, we found that the ELPH software (http://cbcb.umd.edu/software/ELPH
) was highly effective at identifying the likely RBS in most bacterial genomes. Input to ELPH is a specified motif length and any set of sequences, in which it identifies likely shared motifs using a Gibbs sampling algorithm. ELPH produces a position weight matrix (PWM) that Glimmer3 then uses to score any potential RBS. If a substantial set of training genes is available, the regions upstream from their starts can be given to ELPH to produce a PWM. Otherwise, Glimmer3 can bootstrap itself by first running without a PWM, generating a set of gene predictions, and then extracting regions upstream of those predictions as input to ELPH. Glimmer3 can then be re-run with the PWM to produce a more accurate set of start-site predictions. The entire process can be iterated as desired until a consistent PWM and set of gene predictions result. This strategy of using a Gibbs-sampler to find RBS motifs in an iterative fashion was introduced in the GeneMarkS gene-finding system (Besemer et al., 2001
2.3 Reduced overlapping predictions
Both Glimmer2 and Glimmer3 start by identifying open reading frames (ORFs) with sufficiently high IMM scores to be processed further. In many cases, these ORFs overlap by more than the (user-specified) maximum allowed distance, indicating that only one of them is a true gene. Glimmer2 uses a series of rules based on ORF lengths, ORF scores and the IMM score of the overlapping region to attempt to resolve these overlap cases. When the rules do not produce a clear conclusion, however, Glimmer2 outputs both ORFs with an annotation indicating the overlap. As a result, Glimmer2 can have a high false-positive rate, particularly for high-GC genomes, which have large numbers of overlapping ORFs.
In contrast, Glimmer3 begins by assigning a score to each valid start position within an ORF. This score is the sum of RBS score plus the IMM coding potential score plus a score for the start codon (determined by the relative frequency of each possible start codon in the same training set used to determine the RBS). Each possible start codon is linked to the stop codon that terminates its ORF.
A global dynamic-programming algorithm is then used to select the set of ORFs and start sites with maximum total score across the entire input sequence, subject to the constraint that no overlaps greater than a specified maximum are allowed. Specifically, the set of potential start sites and stop positions is scanned in sorted order by location on the input sequence. At each start or stop feature f, the score of the maximum-scoring set of genes up to and including f is computed as the maximum compatible prior score in any of the six reading frames plus the score of f. Because overlaps are allowed, the value for a feature f may, in fact, change as the result of a feature encountered after f. To accommodate this case, our algorithm backtracks to update scores within the maximum allowed overlap distance and adjusts the scores to avoid double counting the score of the overlap region. Because the maximum overlap distance is typically small compared to the average gene length, the additional cost is usually insignificant.
In many respects this algorithm functions like the hidden Markov models (HMMs) used in other gene-finding programs such as GeneMark.hmm (Lukashin and Borodovsky, 1998
) and EasyGene (Larsen and Krogh, 2003
). The principal differences are that small overlaps between genes are allowed (without resort to a complicated set of overlap states in an HMM) and that potential coding regions are prescored by the IMM in the stop-to-start direction so that the scanning direction of the algorithm effectively alternates on different segments of the sequence. The result is that the final set of Glimmer3 predictions contains no overlaps greater than the specified maximum, and the total number of Glimmer3 predictions is almost always less than the corresponding number of Glimmer2 predictions.
2.4 Improved training with long-orfs
One of Glimmer's strengths has always been the ease with which any user can automatically train it on a new genome. The long-orfs program in the Glimmer system is used to create a training set of genes from a genome by selecting ORFs above a threshold length that do not overlap other ORFs above that threshold length. The threshold length is computed by the program to be the value that maximizes the number of non-overlapping ORFs produced, thus maximizing the amount of data in the training set. For most genomes this approach is quite effective, typically producing a training set containing nearly half of all genes with relatively few ORFs that are not genes. In the case of high-GC genomes (>60% GC), however, the scarcity of stop codons results in an abundance of long ORFs that are not genes. For such genomes, the version of long-orfs in Glimmer2 produces very small output sets, with many incorrect genes.
To overcome this problem, the long-orfs
program in Glimmer3 incorporates a new routine to filter the initial set of ORFs based on amino-acid composition. Here we wish to take advantage of the fact that the genes in widely disparate bacterial genomes tend to use a common, universal amino acid distribution (Luscombe et al., 2001
; Pascal et al., 2005
). By comparing the ORFs found by long-orfs
to a universal distribution, we should be able to eliminate many ORFs that are highly unlikely to be protein-coding genes. Specifically, we compute the distribution of each ORF's amino acids and compare it to both a positive model derived from a large sample of microbial genomes, and to a negative model created from alternative reading frames of those genes. We then compute the ratio of the distance of the candidate ORF's distribution to the positive and negative models, and only those ORFs whose ratio is below a user-specified threshold are passed on to the length- and overlap-calculation stage. These distance calculations are actually computed using the entropies of amino-acid distributions as described in (Ouyang et al., 2004
compares the output of the Glimmer2 and Glimmer3 versions of long-orfs
to the set of all annotated genes for a sample of 13 bacterial and archæal genomes obtained from NCBI (Wheeler et al., 2006
). Note how for the high-GC (67%) Ralstonia solanacearum
genome, Glimmer2's long-orfs
outputs only 288 ORFs, of which a mere 55% match annotated genes. In contrast, Glimmer3's long-orfs
identifies 1175 ORFs, 96% of which match annotated genes.
Glimmer3 and Glimmer2 long-orfs output comparison
2.5 Gene prediction accuracy
In order to test the effect of the improvements in Glimmer3, we compared it to several different sets of data, shown in the tables. First, we compared its predictions on a sample of complete genomes to the ‘known’ genes from those genomes. We used NCBI annotation to determine when a gene was known, by simply removing all genes annotated as ‘hypothetical’ from the set of known genes. Genes assigned a function are in most cases closely homologous to genes from other genomes, which provides independent evolutionary evidence that these genes are real. We are aware that this method has its shortcomings, but no other method yields nearly as many genes for testing. This method also does not guarantee that the start codon is correctly predicted, because homology does not need to extend for the full length of a predicted protein in order to be considered adequate evidence for assigning function.
Another important test of Glimmer 3.0 was its comparison in relation to Glimmer 2.13, and we, therefore, ran both algorithms on the same set of genomes. In we show the accuracy results of both Glimmer2 and Glimmer3 on our benchmark genome sample. Both algorithms were trained and tested on the same data set of non-hypothetical genes in an eight-way cross-validation experiment. (For each genome, the genes were divided into eight approximately equal-size subsets. One subset was held out for testing and Glimmer was trained on the remaining subsets. This was repeated eight times so that each gene was part of one test set.) As shown in the table, Glimmer3 nearly always achieves equal or higher sensitivity than Glimmer2, but with far fewer additional predictions, indicating much greater specificity. Glimmer3 also has far greater agreement between its start codon predictions and those in the benchmark genomes: in 11 of the 13 genomes, it has greater agreement than Glimmer2. For example, in Bacillus anthracis Glimmer3 predicts 726 start sites that agree with the NCBI annotation but disagree with Glimmer2. Note too that we cannot guarantee that the start sites are correctly annotated for any of these genomes, but without further laboratory evidence these are the best data available. compares the results of running both Glimmer2 and Glimmer3 using the output of their respective versions of the long-orfs program to produce a training set. As before, the predictions are compared to non-hypothetical genes in the annotation. In this case, Glimmer2's performance is substantially worse because of errors in its long-orfs output. Glimmer3's performance, however, is substantially the same as when it is trained with annotated data, indicating that it is likely to do well even in the absence of a pre-computed set of ‘known’ genes for training.
Glimmer3 prediction accuracy when trained on confirmed genes
Glimmer3 prediction accuracy with long-orfs training
compares Glimmer3 predictions to those of three other gene-finders, GeneMark.hmm, EasyGene (Larsen and Krogh, 2003
) and GeneMarkS (Besemer et al., 2001
), and shows that Glimmer obtains comparable results. These other systems all run through web servers or by downloading precomputed predictions, and EasyGene uses homology-search results to help determine its parameters. In contrast, Glimmer runs locally and offers many options for choosing parameters and training sets, and can be run on collections of contigs from unfinished assemblies.
Glimmer3 prediction accuracy compared to other gene-finding systems
Further evidence of Glimmer3's improved accuracy is provided by comparing its predictions to the results of recent laboratory experiments to identify unannotated genes. One such experiment was conducted on the hyperthermophilic archæon Pyrococcus furiosos
by Poole et al.
(Poole et al., 2005
), who used microarray expression evidence and recombinant protein tests to examine 127 ORFs not in the NCBI annotation. Of the 17 proteins that the Poole group were able to confirm, Glimmer3 predicted 16, of which 14 also agreed on the start sites.
2.6 Start-site prediction accuracy
Assessing the accuracy of start-site predictions is very difficult due to the scarcity of reliable data about start sites. For E.coli
, a substantial number of proteins have been verified through N-terminal sequencing, providing a highly accurate (although limited to just one species) set of data for measuring the accuracy of start site predictions. The curators of the EcoGene database (Rudd, 2000
) have collected and annotated 878 genes (as of July 2006) from E.coli
with confirmed start sites, and we used these to measure Glimmer3's accuracy. For comparison purposes, we also tested Glimmer3 on three of the same data sets used in (Zhu et al., 2004
) to assess the accuracy of the program MED-Start. All these data sets are described in .
Comparison of start-site prediction accuracy
The table shows Glimmer3 prediction results on these four data sets. On the latest EcoGene dataset we found that Glimmer3 predicted all but four genes, and matched the correct start site on 816 genes (92.9%). For the latter three data sets, we also show two sets of MED predictions: one from (Zhu et al., 2004
) obtained by applying MED-Start to Glimmer2 orfs and the other from the MED web site (http://ctb.pku.edu.cn/main/SheGroup/MED2.htm
) using MED 2.0, which incorporates MED-Start within it. The results show Glimmer3 to be slightly more accurate on the E.coli
data sets, while on the Bacillus subtilis
data set the same as MED-Start while slightly less accurate than MED 2.0.
2.7 Separating sequences from different genomes
Although we designed Glimmer's IMM to model the 3-periodic structure of protein coding sequences, it also can be employed for more general sequence modeling. P.didemni is a photosynthetic microbe that lives as an endosymbiont in its host organism, the sea squirt L.patella. Because P.didemni can only be cultured in L. patella cells, it was not surprising that in the whole-genome shotgun sequencing project for P.didemni, a large number of sequences were from L.patella. Such a mixture of reads from two genomes, at different coverage densities, causes problems for genome assembly software, which typically assumes that its input derives from a single genome that was sampled at a uniform rate. Because no reference sequences were available (which would have allowed us to separate the sequences based on homology), we used the Glimmer IMM to classify the two types of sequences.
We began with an initial assembly of all 82 337 shotgun reads. Because the genome size of the bacterium (approximately 5 million base pairs) is much smaller than that of its eukaryote host (over 160 Mbp), the depth of coverage of the bacterium was much greater. Consequently any large scaffolds in the assembly would, with near certainty, consist of P.didemni sequences. Conversely, reads that failed to align with any other reads (singletons) would disproportionately be from the larger genome. Accordingly, we created training sets by classifying reads from assembly scaffolds at least 10Kbp long as being from P.didemni (36 920 reads), and reads where both the read and its clone-insert mate were singletons as being from L.patella (21 276 reads). This left 24 141 reads unclassified.
We created non-periodic Glimmer IMMs from these two training sets and classified sequences based on which of the two models gave a higher score. In a 5-way cross-validation test using the initial classification sets, the models achieved 98.9% accuracy on P.didemni reads and 99.9% accuracy on L.patella reads. The models classified 22% and 78% of the 24 141 unclassified reads as being from P.didemni and L.patella, respectively. One way to measure prediction accuracy on this test set is by considering the predictions of mate-pair reads. Because each pair comes from a single DNA template, the classification of both reads in the pair should be the same. There were 10 500 mate pairs in our unclassified set, of which only 207 (2%) were inconsistently classified. Since each inconsistent pair has one correct and one incorrect classification, this indicates 99% accuracy.
Besides the obvious benefit of removing the host sea squirt sequence from the assembly result, separating the two types of reads produced improvements in the quality of the shotgun assembly. The assembly of P.didemni
using the mix of all reads produced 65 scaffolds 20 Kbp or longer, with total length of 5.74 Mbp. The assembly using just reads classified as P.didemni
yielded 58 scaffolds 20 Kbp or longer, with total length of 5.84 Mbp. Both assemblies were run with the same parameter settings using the Celera Assembler program (Myers et al., 2000