A published benchmark (independent of our lab) (Freyhult et al.
) and our own internal benchmark used during development (Nawrocki and Eddy, 2007
) both find that infernal
and other CM-based methods are the most sensitive and specific tools for structural RNA homology search among those tested. shows updated results of our internal benchmark comparing infernal
1.0 with the previous version (0.72) that was benchmarked in Freyhult et al.
), and also to family-pairwise search with BLASTN (Altschul et al.
; Grundy, 1998
's sensitivity and specificity have greatly improved, due to mainly three relevant improvements in the implementation (Eddy, 2009
): a biased composition correction to the raw log-odds scores, the use of Inside log likelihood scores (the summed score of all possible alignments of the target sequence) in place of CYK scores (the single maximum likelihood alignment score) and the introduction of approximate E
-value estimates for the scores.
Fig. 1. ROC curves for the benchmark. Plots are shown for the new infernal 1.0 with and without filters, for the old infernal 0.72 and for family-pairwise searches (FPS) with blastn. CPU times are total times for all 51 family searches measured for single execution (more ...)
The benchmark dataset used in includes query alignments and test sequences from 51 Rfam
(release 7) families [details in (Nawrocki and Eddy, 2007
)]. No query sequence is >60% identical to a test sequence. The 450 total test sequences were embedded at random positions in a 10 Mb ‘pseudogenome’. Previously, we generated the pseudogenome sequence from a uniform residue frequency distribution (Nawrocki and Eddy, 2007
). Because base composition biases in the target sequence database cause the most serious problems in separating significant CM hits from noise, we improved the realism of the benchmark by generating the pseudogenome sequence from a 15-state fully connected HMM trained by Baum–Welch expectation maximization (Durbin et al.
) on genome sequence data from a wide variety of species. Each of the 51 query alignments was used to build a CM and search the pseudogenome, a single list of all hits for all families were collected and ranked, and true and false hits were defined, as described in Nawrocki and Eddy (2007
), producing the ROC curves in .
searches require a large amount of compute time [our 10 Mb benchmark search takes about 30 h per model on average ()]. To alleviate this, infernal
1.0 implements two rounds of filtering. When appropriate, the HMM filtering technique described by Weinberg and Ruzzo (2006
) is applied first with filter thresholds configured by cmcalibrate
[occasionally a model with little primary sequence conservation cannot be usefully accelerated by a primary sequence-based filter as explained in (Eddy, 2009
)]. The query-dependent banded (QDB) CYK maximum likelihood search algorithm is used as a second filter with relatively tight bands [β=10−7
, the β parameter is the subtree length probability mass excluded by imposing the bands as explained in Nawrocki and Eddy (2007
)]. Any sequence fragments that survive the filters are searched a final time with the Inside algorithm [again using QDB, but with looser bands (β = 10−15
)]. In our benchmark, the default filters accelerate similarity search by about 30-fold overall, while sacrificing a small amount of sensitivity (Fig. 1). This makes version 1.0 substantially faster than 0.72. BLAST is still orders of magnitude faster, but significantly less sensitive than infernal
. Further acceleration remains a major goal of infernal
The computational cost of CM alignment with cmalign
has been a limitation of previous versions of infernal
. Version 1.0 now uses a constrained dynamic programming approach first developed by Brown (2000
) that uses sequence-specific bands derived from a first-pass HMM alignment. This technique offers a dramatic speedup relative to unconstrained alignment, especially for large RNAs such as small and large subunit (SSU and LSU, respectively) ribosomal RNAs, which can now be aligned in roughly 1 and 3 s per sequence, respectively, as opposed to 12 min and 3 h in previous versions. This acceleration has facilitated the adoption of infernal
by RDP, one of the main ribosomal RNA databases (Cole et al.
infernal is now a faster and more sensitive tool for RNA sequence analysis. Version 1.0's heuristic acceleration techniques make some important applications possible on a single desktop computer in less than an hour, such as searching a prokaryotic genome for a particular RNA family, or aligning a few thousand SSU rRNA sequences. Nonetheless, infernal remains computationally expensive, and many problems of interest require the use of a cluster. The most expensive programs (cmcalibrate, cmsearch and cmalign) are implemented in coarse-grained parallel MPI versions which divide the workload into independent units, each of which is run on a separate processor.