|Home | About | Journals | Submit | Contact Us | Français|
The Glimmer gene-finding software has been successfully used for finding genes in bacteria, archæa and viruses representing hundreds of species. We describe several major changes to the Glimmer system, including improved methods for identifying both coding regions and start codons. We also describe a new module of Glimmer that can distinguish host and endosymbiont DNA. This module was developed in response to the discovery that eukaryotic genome sequencing projects sometimes inadvertently capture the DNA of intracellular bacteria living in the host.
The new methods dramatically reduce the rate of false-positive predictions, while maintaining Glimmer's 99% sensitivity rate at detecting genes in most species, and they find substantially more correct start sites, as measured by comparisons to known and well-curated genes. We show that our interpolated Markov model (IMM) DNA discriminator correctly separated 99% of the sequences in a recent genome project that produced a mixture of sequences from the bacterium Prochloron didemni and its sea squirt host, Lissoclinum patella.
The genomes of bacteria, archæa and viruses are very gene-dense, with protein-coding regions typically comprising 90% or more of the DNA sequence. As a consequence, the accuracy of prokaryotic gene-finding programs depends primarily on identifying which of the six possible reading frames contains the true gene (Besemer and Borodovsky, 1999; Borodovsky and McIninch, 1993; Guo et al., 2003; Ouyang et al., 2004). The accuracy of gene finding systems in these species is very high as compared to eukaryotic gene finders; previous versions of the Glimmer system had a sensitivity of 99% or higher (Delcher et al., 1999; Salzberg et al., 1998).
However, there is still some room for improvement. First, the measurement of sensitivity relies on comparisons to well-annotated bacterial genomes, where the best we can do is to count how many ‘known’ genes are found by a gene finder. Genes are considered known if they have clear homology, as measured by amino-acid similarity, to genes in other species. This similarity often breaks down near the 5′ end of the transcript, which also tends to be the region where gene finders disagree on the precise position of the start codon. Thus, one area where gene finders might still improve is in prediction of start sites, as has been pointed out in previous studies (Besemer et al., 2001).
A second issue is false positives, i.e. gene predictions that do not correspond to genuine protein-coding genes. Because bacteria are so gene-dense, it is very difficult to say with confidence that any gene predicted to lie in an otherwise intergenic region is false. Fortunately, the growing number of sequenced genomes from closely related species does provide some help with this question, and indeed some microbial gene-finding systems rely on database searches to identify genes (Badger and Olsen, 1999; Frishman et al., 1998; Larsen and Krogh, 2003; Nielsen and Krogh, 2005). If a predicted protein is not conserved between closely related species, then evolutionary arguments can be made that the prediction is false. A greater source of false positives in earlier releases of Glimmer, though, came from predicting too many overlapping genes. Because truly overlapping genes are quite rare in bacterial genomes, the system should generally avoid such predictions. Here too, homology to other species can resolve the question of which gene is correct. Our challenge was to reduce the false positive rate of Glimmer without sacrificing its high sensitivity (true positive) rate.
The new Glimmer, release 3.0, achieves a dramatically lower false-positive rate, predicts many more start sites correctly, and maintains its high true positive rate. It does this through a new algorithm for scanning coding regions, a new start site detection module, and an overall architecture that for the first time integrates all gene predictions across an entire genome. In addition, a new automated training program produces substantially improved training sets, particularly on genomes with high GC-content.
We also introduce a new use for the interpolated Markov model (IMM) that is at the core of Glimmer. Recent large-scale sequencing projects of eukaryotic species have inadvertently captured the genomes of bacterial endosymbionts as a side effect of the overall project (Salzberg et al., 2005). When a eukaryotic species has an intracellular endosymbiont, as is true for many invertebrates including fruit flies, mosquitoes and nematodes, then a whole-genome shotgun sequencing project cannot avoid capturing some of the symbiont DNA. Consequently there is a need to identify and separate the DNA from the host and the symbiont, in order to assemble the two genomes separately (and correctly). Besides these eukaryotic genome projects, a growing number of bacterial sequencing projects are targeting endosymbionts that can only be grown inside their hosts, including Wolbachia pipientis (Wu et al., 2004) and Prochloron didemni (J. Ravel, personal communication). In these projects, despite investigators' best efforts to isolate bacterial DNA, a considerable amount of eukaryotic host DNA remained in the sample and needed to be removed. Motivated by these problems, we developed a new algorithm in which the IMM within Glimmer is trained separately on host and endosymbiont DNA, and then turned into a classifier to separate the raw sequences. We report here on this new module and its successful use in two recent genome projects.
The Glimmer 3.0 package is distributed as OSI Certified Open Source software and is freely available at http://cbcb.umd.edu/software/glimmer
The IMM scoring algorithm in Glimmer computes the log-likelihood that a given interval on a DNA sequence was generated by a model of coding versus noncoding DNA. This model represents the probability of a nucleotide given a subset of positions in a window (called the context) adjacent to the nucleotide—for details, see (Delcher et al., 1999; Salzberg et al., 1998). Glimmer 3.0 takes advantage of the flexibility of this algorithm by scoring all open reading frames (ORFs) in reverse, from the stop codon back toward the start codon, with the probability of each base conditioned on a context window on its 3′ side and the score of the ORF being the log-likelihood sum of the bases contained in the ORF. The score is computed incrementally as a cumulative sum at every codon position in a given ORF. In many cases, these scores show a marked peak in value, and furthermore this peak typically occurs near the correct start site (see Fig. 1 for an example). The advantage of scanning ORFs in reverse is that for nucleotides near the start site, the context window of the IMM is contained within the coding portion of the gene, which is the type of data on which it was trained. This results in a more precise cumulative score at nucleotides very close to the start site, compared to a context window on the opposite side which would intersect a non-coding region.
As the figure shows, the cumulative IMM score steadily increases as we move away from the stop codon at the left until it reaches a peak and then begins to decline. The decrease occurs because the bases upstream of the gene start codon are non-coding and produce negative IMM log-odds scores. The figure also shows, as vertical dashed lines, the positions of all possible start codons in the ORF. We hypothesized that if we used the highest-scoring start codon in these plots, then Glimmer would find a higher percentage of true start sites. This is borne out in our experiments, described subsequently.
An important difference between this algorithm and earlier versions of Glimmer is that, unlike those versions, which had a strong bias in favor of longer ORFs, this algorithm chooses start sites based on their relative scores.
In previous versions of Glimmer, the ribosome binding site (RBS) was essentially ignored, even though it provides a strong signal for the position of the true start site. We addressed this problem with a standalone program, RBSfinder, that can be run as a post-processor on the results of Glimmer's analysis. RBSfinder is quite effective at finding ribosome binding sites and adjusting Glimmer's predictions (Suzek et al., 2001), but we nonetheless felt that a better design would integrate RBS evidence directly into the gene-finding algorithm. Glimmer3 now contains this long-awaited integration.
After experimenting with several alternative algorithms, we found that the ELPH software (http://cbcb.umd.edu/software/ELPH) was highly effective at identifying the likely RBS in most bacterial genomes. Input to ELPH is a specified motif length and any set of sequences, in which it identifies likely shared motifs using a Gibbs sampling algorithm. ELPH produces a position weight matrix (PWM) that Glimmer3 then uses to score any potential RBS. If a substantial set of training genes is available, the regions upstream from their starts can be given to ELPH to produce a PWM. Otherwise, Glimmer3 can bootstrap itself by first running without a PWM, generating a set of gene predictions, and then extracting regions upstream of those predictions as input to ELPH. Glimmer3 can then be re-run with the PWM to produce a more accurate set of start-site predictions. The entire process can be iterated as desired until a consistent PWM and set of gene predictions result. This strategy of using a Gibbs-sampler to find RBS motifs in an iterative fashion was introduced in the GeneMarkS gene-finding system (Besemer et al., 2001).
Both Glimmer2 and Glimmer3 start by identifying open reading frames (ORFs) with sufficiently high IMM scores to be processed further. In many cases, these ORFs overlap by more than the (user-specified) maximum allowed distance, indicating that only one of them is a true gene. Glimmer2 uses a series of rules based on ORF lengths, ORF scores and the IMM score of the overlapping region to attempt to resolve these overlap cases. When the rules do not produce a clear conclusion, however, Glimmer2 outputs both ORFs with an annotation indicating the overlap. As a result, Glimmer2 can have a high false-positive rate, particularly for high-GC genomes, which have large numbers of overlapping ORFs.
In contrast, Glimmer3 begins by assigning a score to each valid start position within an ORF. This score is the sum of RBS score plus the IMM coding potential score plus a score for the start codon (determined by the relative frequency of each possible start codon in the same training set used to determine the RBS). Each possible start codon is linked to the stop codon that terminates its ORF.
A global dynamic-programming algorithm is then used to select the set of ORFs and start sites with maximum total score across the entire input sequence, subject to the constraint that no overlaps greater than a specified maximum are allowed. Specifically, the set of potential start sites and stop positions is scanned in sorted order by location on the input sequence. At each start or stop feature f, the score of the maximum-scoring set of genes up to and including f is computed as the maximum compatible prior score in any of the six reading frames plus the score of f. Because overlaps are allowed, the value for a feature f may, in fact, change as the result of a feature encountered after f. To accommodate this case, our algorithm backtracks to update scores within the maximum allowed overlap distance and adjusts the scores to avoid double counting the score of the overlap region. Because the maximum overlap distance is typically small compared to the average gene length, the additional cost is usually insignificant.
In many respects this algorithm functions like the hidden Markov models (HMMs) used in other gene-finding programs such as GeneMark.hmm (Lukashin and Borodovsky, 1998) and EasyGene (Larsen and Krogh, 2003). The principal differences are that small overlaps between genes are allowed (without resort to a complicated set of overlap states in an HMM) and that potential coding regions are prescored by the IMM in the stop-to-start direction so that the scanning direction of the algorithm effectively alternates on different segments of the sequence. The result is that the final set of Glimmer3 predictions contains no overlaps greater than the specified maximum, and the total number of Glimmer3 predictions is almost always less than the corresponding number of Glimmer2 predictions.
One of Glimmer's strengths has always been the ease with which any user can automatically train it on a new genome. The long-orfs program in the Glimmer system is used to create a training set of genes from a genome by selecting ORFs above a threshold length that do not overlap other ORFs above that threshold length. The threshold length is computed by the program to be the value that maximizes the number of non-overlapping ORFs produced, thus maximizing the amount of data in the training set. For most genomes this approach is quite effective, typically producing a training set containing nearly half of all genes with relatively few ORFs that are not genes. In the case of high-GC genomes (>60% GC), however, the scarcity of stop codons results in an abundance of long ORFs that are not genes. For such genomes, the version of long-orfs in Glimmer2 produces very small output sets, with many incorrect genes.
To overcome this problem, the long-orfs program in Glimmer3 incorporates a new routine to filter the initial set of ORFs based on amino-acid composition. Here we wish to take advantage of the fact that the genes in widely disparate bacterial genomes tend to use a common, universal amino acid distribution (Luscombe et al., 2001; Pascal et al., 2005). By comparing the ORFs found by long-orfs to a universal distribution, we should be able to eliminate many ORFs that are highly unlikely to be protein-coding genes. Specifically, we compute the distribution of each ORF's amino acids and compare it to both a positive model derived from a large sample of microbial genomes, and to a negative model created from alternative reading frames of those genes. We then compute the ratio of the distance of the candidate ORF's distribution to the positive and negative models, and only those ORFs whose ratio is below a user-specified threshold are passed on to the length- and overlap-calculation stage. These distance calculations are actually computed using the entropies of amino-acid distributions as described in (Ouyang et al., 2004).
Table 1 compares the output of the Glimmer2 and Glimmer3 versions of long-orfs to the set of all annotated genes for a sample of 13 bacterial and archæal genomes obtained from NCBI (Wheeler et al., 2006). Note how for the high-GC (67%) Ralstonia solanacearum genome, Glimmer2's long-orfs outputs only 288 ORFs, of which a mere 55% match annotated genes. In contrast, Glimmer3's long-orfs identifies 1175 ORFs, 96% of which match annotated genes.
In order to test the effect of the improvements in Glimmer3, we compared it to several different sets of data, shown in the tables. First, we compared its predictions on a sample of complete genomes to the ‘known’ genes from those genomes. We used NCBI annotation to determine when a gene was known, by simply removing all genes annotated as ‘hypothetical’ from the set of known genes. Genes assigned a function are in most cases closely homologous to genes from other genomes, which provides independent evolutionary evidence that these genes are real. We are aware that this method has its shortcomings, but no other method yields nearly as many genes for testing. This method also does not guarantee that the start codon is correctly predicted, because homology does not need to extend for the full length of a predicted protein in order to be considered adequate evidence for assigning function.
Another important test of Glimmer 3.0 was its comparison in relation to Glimmer 2.13, and we, therefore, ran both algorithms on the same set of genomes. In Table 2 we show the accuracy results of both Glimmer2 and Glimmer3 on our benchmark genome sample. Both algorithms were trained and tested on the same data set of non-hypothetical genes in an eight-way cross-validation experiment. (For each genome, the genes were divided into eight approximately equal-size subsets. One subset was held out for testing and Glimmer was trained on the remaining subsets. This was repeated eight times so that each gene was part of one test set.) As shown in the table, Glimmer3 nearly always achieves equal or higher sensitivity than Glimmer2, but with far fewer additional predictions, indicating much greater specificity. Glimmer3 also has far greater agreement between its start codon predictions and those in the benchmark genomes: in 11 of the 13 genomes, it has greater agreement than Glimmer2. For example, in Bacillus anthracis Glimmer3 predicts 726 start sites that agree with the NCBI annotation but disagree with Glimmer2. Note too that we cannot guarantee that the start sites are correctly annotated for any of these genomes, but without further laboratory evidence these are the best data available. Table 3 compares the results of running both Glimmer2 and Glimmer3 using the output of their respective versions of the long-orfs program to produce a training set. As before, the predictions are compared to non-hypothetical genes in the annotation. In this case, Glimmer2's performance is substantially worse because of errors in its long-orfs output. Glimmer3's performance, however, is substantially the same as when it is trained with annotated data, indicating that it is likely to do well even in the absence of a pre-computed set of ‘known’ genes for training.
Table 4 compares Glimmer3 predictions to those of three other gene-finders, GeneMark.hmm, EasyGene (Larsen and Krogh, 2003) and GeneMarkS (Besemer et al., 2001), and shows that Glimmer obtains comparable results. These other systems all run through web servers or by downloading precomputed predictions, and EasyGene uses homology-search results to help determine its parameters. In contrast, Glimmer runs locally and offers many options for choosing parameters and training sets, and can be run on collections of contigs from unfinished assemblies.
Further evidence of Glimmer3's improved accuracy is provided by comparing its predictions to the results of recent laboratory experiments to identify unannotated genes. One such experiment was conducted on the hyperthermophilic archæon Pyrococcus furiosos by Poole et al. (Poole et al., 2005), who used microarray expression evidence and recombinant protein tests to examine 127 ORFs not in the NCBI annotation. Of the 17 proteins that the Poole group were able to confirm, Glimmer3 predicted 16, of which 14 also agreed on the start sites.
Assessing the accuracy of start-site predictions is very difficult due to the scarcity of reliable data about start sites. For E.coli, a substantial number of proteins have been verified through N-terminal sequencing, providing a highly accurate (although limited to just one species) set of data for measuring the accuracy of start site predictions. The curators of the EcoGene database (Rudd, 2000) have collected and annotated 878 genes (as of July 2006) from E.coli with confirmed start sites, and we used these to measure Glimmer3's accuracy. For comparison purposes, we also tested Glimmer3 on three of the same data sets used in (Zhu et al., 2004) to assess the accuracy of the program MED-Start. All these data sets are described in Table 5.
The table shows Glimmer3 prediction results on these four data sets. On the latest EcoGene dataset we found that Glimmer3 predicted all but four genes, and matched the correct start site on 816 genes (92.9%). For the latter three data sets, we also show two sets of MED predictions: one from (Zhu et al., 2004) obtained by applying MED-Start to Glimmer2 orfs and the other from the MED web site (http://ctb.pku.edu.cn/main/SheGroup/MED2.htm) using MED 2.0, which incorporates MED-Start within it. The results show Glimmer3 to be slightly more accurate on the E.coli data sets, while on the Bacillus subtilis data set the same as MED-Start while slightly less accurate than MED 2.0.
Although we designed Glimmer's IMM to model the 3-periodic structure of protein coding sequences, it also can be employed for more general sequence modeling. P.didemni is a photosynthetic microbe that lives as an endosymbiont in its host organism, the sea squirt L.patella. Because P.didemni can only be cultured in L. patella cells, it was not surprising that in the whole-genome shotgun sequencing project for P.didemni, a large number of sequences were from L.patella. Such a mixture of reads from two genomes, at different coverage densities, causes problems for genome assembly software, which typically assumes that its input derives from a single genome that was sampled at a uniform rate. Because no reference sequences were available (which would have allowed us to separate the sequences based on homology), we used the Glimmer IMM to classify the two types of sequences.
We began with an initial assembly of all 82 337 shotgun reads. Because the genome size of the bacterium (approximately 5 million base pairs) is much smaller than that of its eukaryote host (over 160 Mbp), the depth of coverage of the bacterium was much greater. Consequently any large scaffolds in the assembly would, with near certainty, consist of P.didemni sequences. Conversely, reads that failed to align with any other reads (singletons) would disproportionately be from the larger genome. Accordingly, we created training sets by classifying reads from assembly scaffolds at least 10Kbp long as being from P.didemni (36 920 reads), and reads where both the read and its clone-insert mate were singletons as being from L.patella (21 276 reads). This left 24 141 reads unclassified.
We created non-periodic Glimmer IMMs from these two training sets and classified sequences based on which of the two models gave a higher score. In a 5-way cross-validation test using the initial classification sets, the models achieved 98.9% accuracy on P.didemni reads and 99.9% accuracy on L.patella reads. The models classified 22% and 78% of the 24 141 unclassified reads as being from P.didemni and L.patella, respectively. One way to measure prediction accuracy on this test set is by considering the predictions of mate-pair reads. Because each pair comes from a single DNA template, the classification of both reads in the pair should be the same. There were 10 500 mate pairs in our unclassified set, of which only 207 (2%) were inconsistently classified. Since each inconsistent pair has one correct and one incorrect classification, this indicates 99% accuracy.
Besides the obvious benefit of removing the host sea squirt sequence from the assembly result, separating the two types of reads produced improvements in the quality of the shotgun assembly. The assembly of P.didemni using the mix of all reads produced 65 scaffolds 20 Kbp or longer, with total length of 5.74 Mbp. The assembly using just reads classified as P.didemni yielded 58 scaffolds 20 Kbp or longer, with total length of 5.84 Mbp. Both assemblies were run with the same parameter settings using the Celera Assembler program (Myers et al., 2000).
The latest release of the Glimmer gene-finding system is significantly improved compared to its predecessor, most notably with respect to specificity and accuracy in predicting translation initiation sites. A major difficulty in developing software that can make highly accurate predictions of coding starts is the lack of experimentally confirmed training and testing data. Despite this limitation, Glimmer3 start-site predictions have achieved a remarkably high success rate on the best-available dataset, the Ecogene verified genes in E.coli.
A notable advantage of the Glimmer3 system is that it is completely self-contained. This permits users to choose either the long-orfs program, or any set of genes that may wish to use, as training sets to build the IMM and find the RBS motif. This permits the system to be used on relatively short sequence fragments, like low-coverage draft genome assembly sequences. If necessary, a user can even use a closely related organism as a source of training data.
Another advantage of Glimmer3 is that it is distributed as source code that can run on any system with a C++ compiler. This enables any part of the program to be modified by the user and allows various modules to be used for purposes other than gene-finding, as we have demonstrated in separating target and host-contaminant data for the P.didemni shotgun sequencing project.
Thanks to Daniel H. Haft and William C. Nelson at TIGR for suggesting improvements and helping to test Glimmer, to Michaela Pertea for assistance with the ELPH program, and to Jacques Ravel for providing the P. didemni sequences under NSF grant EF-0412226. This work was supported in part by NIH grants R01-LM006845 and R01-LM007938 and HSARPA award W81XWH-05-2-0051 to SLS, and by NIAD contract. NIH-NIAID-DMID-04-34, HHSN266200400038C. Funding to pay the Open Access publication charges was provided by NIH grant R01-LM007938.
Associate Editor: Alfonso Valencia
Availability:Glimmer is OSI Certified Open Source and available at http://cbcb.umd.edu/software/glimmer
Conflict of Interest: none declared.