Non-coding RNAs have drawn much attention in the last couple of years, after being neglected for a long time [
1]. They are now known to play key roles in diverse cellular processes such as regulation of gene expression, splicing and directing chemical modifications [
2,
3]. Functional categorization of RNAs is not yet complete as new functions are discovered continuously [
4,
5].
Detection of non-coding RNA genes in genomic sequences is an urgent but unsolved problem in bioinformatics [
6]. The accelerated pace of sequencing technology further increases the need for reliable identification of ncRNAs [
7]. The main approaches to computational prediction of ncRNAs are compositional analysis, secondary structure prediction, structural or sequence-based homology and the use of promoters and terminator signals. Numerous tools following one of these approaches or combinations thereof exist [
6,
8].
Compositional analysis can be a simple scan for local GC-content, an approach successful in AT-rich hyperthermophiles [
9]. Considering more compositional features in a machine learning approach has also shown success [
10]. Based on the fact that functional RNAs rely on a defined secondary structure, prediction of transcript minimum free energy is used as a means for detecting ncRNA genes [
11]. Freyhult examined different quantities that can be used for this approach [
12]. Sequence-based homology can be used for detection if reference genomes with appropriate evolutionary distances are available [
13].
Successful tools such as QRNA [
14] and RNAz [
15] combine secondary structure prediction with a homology approach relying on multiple alignments. The most comprehensive RNA family database RFAM [
16] uses covariance models combining structural and sequence conservation to establish RNA families. The covariance model can be used to find new members of existing families, however, at the expense of computational effort. Dynalign [
17] uses an approximation of Sankoff's Algorithm for structural alignment of two RNAs.
Xiao
et al. used promoter and terminator prediction in intergenic regions aided by conservation and secondary structure analysis to predict ncRNAs [
18].
To achieve better accuracy, some tools limit the scope to specific ncRNA families such as tRNA, miRNA and snoRNA [
6].
However, none of the available tools for
general ncRNA detection has reached a level of reliability comparable to protein-gene detection software. In contrast to ncRNA genes, protein genes exhibit codon-bias, open reading frames and strong sequence conservation, simplifying their detection. Since the diverse methods for ncRNA detection are complementary, a practical approach is to combine the available methods, as suggested by recent reviews [
6,
8,
19,
20]. Meyer
et al. also remarked that many ncRNA detection methods rest on the assumption of a significant secondary structure, which may not always be necessary for a ncRNA to function [
8]. Consequently, even the more successful methods, which rely on this assumption, need to be complemented with others to achieve more comprehensive predictions.
The combination of methods allows for precise predictions by using candidates that are predicted by several methods, or finding more candidates by using predictions from all methods. If the combination is done under a well designed framework, reproducibility, transparency and comparison of predictions are improved as well.
Previous efforts for the integration of data and algorithms in genomic research exist: RNAStructure integrated secondary structure prediction and structure based homology analysis but is not easily extended and not readily useable for genomic scans [
21]. Tools such as sRNAfinder [
22] combine several approaches to improve prediction results, but in a predefined way. The UCSC genome browser offers a huge amount of experimental data, pre-calculated predictions and analyses for a selected number of genomes [
23]. Basic functions for comparative genomics are available, extended by an interface to Galaxy. Galaxy is a project that also aims to overcome custom and redundant scripting for bioinformatics tasks in genomic research, but does not yet offer specialized tools for ncRNA prediction [
24]. TAVERNA is a powerful all-purpose framework, but its primary source of functionality "BioCatalogue" does not yet contain essential ncRNA related tools such as RNAz and Dynalign [
25]. LeARN is an extensible framework for annotating newly sequenced genomes, but it is more focused on processing trusted results from detection tools rather than improving predictions by the combination of analyses from different algorithms [
26]. Consequently, there is a need for a framework that is easy to use and specialized for non-coding RNA detection. The main goals of our project are:
• Combination: Improving ncRNA detection by combining existing methods.
• Comparison: Easy comparison of the prediction performance of different methods must be possible.
• Reproducibility: application, combination and comparison of methods must be performed in a reproducible and transparent way.
• Usability: User experience should be improved by a GUI and visualization of all workflow steps and their respective results. No programming should be required to construct workflows, and to combine and compare methods.
Our software is aimed at three user groups: First, for bioinformaticians, the use and the combination of integrated tools must be simple. Second, developers of new algorithms for ncRNA detection must be provided with a ready-to-use environment and test bed. This removes the need to re-program solutions for tasks such as parsing files or visualization. Third, biologists must be able to re-use tested methods easily.
The implementation presented here supports compositional analysis, sequence-based homology (BLAST [
27]), sequence and structural homology (RNAz [
15] and Dynalign [
17]) and secondary structure prediction (using RNAfold [
28]). Our tool can easily be extended through an open architecture.
We will show how moses was designed to fulfill the given goals in the next section. In case studies we then demonstrate the effectiveness of combining methods: Precision or sensitivity are increased alternatively. Furthermore, our framework has been successfully applied to guide experiments in Streptococcus pyogenes to find new ncRNAs.