Our aim has been to enable high-throughput searches for rRNA while producing accurate and consistent predictions suitable for comparative analyses. For this purpose, we have developed the RNAmmer package which relies on HMMs for both speed and accuracy. HMMs were made using HMMer (15), which from a multiple alignment produces an HMM where match states represent columns with a specific nucleotide distribution, corresponding deletion states represent the possibility of gaps, and insertion states represent columns with large numbers of gaps; transition probabilities between the states indicate how likely each of the states are. HMMs thus differ from sequence alignments in that the likelihood of insertions and deletions may vary along the sequence. When searching a sequence with an HMM, the score indicates how well the sequence segment matches the model. The information content of a position, which reflects the nucleotide distribution and the likelihood of gaps, indicates how well that position is conserved. A good match to the HMM may come either from a highly conserved region which may well be short, or from a longer region with only weak conservation. We find both these cases. Bacterial 16S are detected despite almost half of the nucleotides being assigned to insert states, as other regions are highly conserved. For archaeal 23S, however, the information content of each position is low, but the sequence is long and there are few allowed insert states.
These aspects can also explain cases of poor performance, both of the full model and of the spotter model. The low information content in the eukaryotic 5S and 18S alignments indicates that these sequences are more divergent than archaeal and bacterial 5S and 16S. In addition, 40% of the 5S and 75% of the 18S alignment give rise to insert states in the HMM. Thus, there is little for the HMM to recognize. In addition, many of the missed 18S rRNAs were from Cryptophyta, a phylum which makes up only 0.6% of the alignment data.
The archaeal 5S show the same characteristics as the eukaryotic 5S and 18S, which most likely explains the low performance for these rRNAs. The score for archaeal 5S hits were generally low, and the spotter score comes only from a 75 nt part of the sequence giving it even lower score causing it to miss 12 of the full model hits. It is notable, however, that these were the only cases missed by the spotter model: with the exception of archaeal 5S, our analyses show that the spotter should be able to detect rRNAs unless they are much further diverged than what we find in our data.
Columns at the beginning and end of the multiple alignments often have low conservation and many gaps. Such columns are generally accommodated into the HMM as insert states, but HMMer ignores them at the beginning and end of the alignment. An example is the 5S, where match states stop around 10 columns from the end of the alignments effectively causing the HMM to predict the last conserved nucleotide of the consensus sequence rather than the stop of the rRNAs. Hence, it is not uncommon for the stop position of the 5S to be predicted up to 10 nt downstream of the annotated stop position.
These effects can also explain the endpoint accuracy that was seen when we compared our results to experimentally determined 16S sequences. We tried to find sequences where the ends had been experimentally verified by RACE or PCR, but such rRNAs proved difficult to find. All the ones we selected were sequenced, but it is uncertain to what extent the authors had tried to determine the ends. These experimentally found rRNAs did show better agreement with annotation than predictions in general, although this is not sufficient to conclude that our predictions are more accurate. Our stop predictions were very accurate, but more deviation was seen in the start predictions. These results could reflect more variation in the beginning of the alignments, which as in the 5S case could effectively cause the HMM to predict the last conserved nucleotide of the consensus sequence rather than the end of the rRNAs.
In some cases, larger endpoint deviations occur. This can happen when one of the ends of the model finds a better match in a different part of the sequence. Insertion states sometimes allows the HMM to insert long gap regions and thus find a matching stop position far from the rest of the sequence. As shown for the bacterial 16S sequences that displayed this phenomenon, this is less of a problem when the spotter model is employed. The window searched around the spotter hit would most likely be too short to accommodate such an insert, and the model would match with the proper sequence.
For fragmented rRNAs, long gap regions may be correctly predicted. This was seen for Coxiella burnetii 23S where our prediction has the same start position as annotated, but where the predicted stop position is 1884 nt downstream of GenBank's stop position. However, according to Entrez Gene, this rRNA appears in four pieces and with the same stop position as ours, suggesting that in some cases ‘too long’ predictions might actually be correct. These cases should normally not be masked when using the spotter unless inserts between the fragments would make it exceed the window size.
The HMM produced by HMMer requires time of order O(NM) to search a sequence of length N using a model with M states, M being proportional to the length of the multiple alignment. However, the speed is increased by using a 75 nt long spotter model to pre-screen the sequence, which requires time of order O(N), and then running the full HMM on windows around each spotter hit which requires time of order O(KM2) for K spotter hits, and window size proportional to M. The benefit of using the spotter is clearly illustrated in the M. capricolum searches. However, the time difference between the S. usitatus and the Sargasso Sea data searches shows that the spotter might lose its mission when dealing with many shorter sequences.
There are other approaches to predicting non-coding RNA. One commonly used method is sequence alignment, e.g. BLAST (3), Paralign (20) or FASTA (21). Another is based on structure-sensitive Stochastic Context Free Grammars (SCFG) (22) which form the basis of the tRNA prediction program tRNAscan-SE (23) and of Infernal (24), which is used when creating RFAM. While the sequence alignment methods are very fast, they are not particularly suited for prediction of non-coding RNA (1). Infernal, however, has a general worst case running time of order O(MN3) , which is prohibitive. The RFAM database (17,18), which includes 5S and the 5′ domain of 16S, uses BLAST to pre-screen genome sequences, followed by Infernal; despite a more efficient approach than the general SCFG, it does not analyze the entire 16S. A search for 5S in a 1 Mbp genome using Infernal took 4 hours 45 minutes: almost 1000 times as much as the 16 seconds used by RNAmmer for the much larger 16S model. A time-saving approach to SCFGs could be to use the RaveNna (25) package which can convert an RFAM SCFG to an HMM. This drastically reduces the running time; however, its usefulness would be limited since no models for the larger rRNAs are available. Another factor is that the 5S found by RaveNna (26) which were not already in RFAM were all in organellar sequences, sequences not analyzed by RNAmmer. For further comparisons and comments on these different methods, we refer to (1).
The RNAmmer program is available as a traditional HTML-based prediction server at http://www.cbs.dtu.dk/services/RNAmmer as well as through a SOAP-based web service. It is also available for download through the same site.