|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Regulatory, non-coding RNAs often function by forming a duplex with other RNAs. It is therefore of interest to predict putative RNA–RNA duplexes in silico on a genome-wide scale. Current computational methods for predicting these interactions range from fast complementary-based searches to those that take intramolecular binding into account. Together these methods constitute a trade-off between speed and accuracy, while leaving room for improvement within the context of genome-wide screens. A fast pre-filtering of putative duplexes would therefore be desirable.
Results: We present RIsearch, an implementation of a simplified Turner energy model for fast computation of hybridization, which significantly reduces runtime while maintaining accuracy. Its time complexity for sequences of lengths m and n is with a much smaller pre-factor than other tools. We show that this energy model is an accurate approximation of the full energy model for near-complementary RNA–RNA duplexes. RIsearch uses a Smith–Waterman-like algorithm using a dinucleotide scoring matrix which approximates the Turner nearest-neighbor energies. We show in benchmarks that we achieve a speed improvement of at least 2.4× compared with RNAplex, the currently fastest method for searching near-complementary regions. RIsearch shows a prediction accuracy similar to RNAplex on two datasets of known bacterial short RNA (sRNA)–messenger RNA (mRNA) and eukaryotic microRNA (miRNA)–mRNA interactions. Using RIsearch as a pre-filter in genome-wide screens reduces the number of binding site candidates reported by miRNA target prediction programs, such as TargetScanS and miRanda, by up to 70%. Likewise, substantial filtering was performed on bacterial RNA–RNA interaction data.
Availability: The source code for RIsearch is available at: http://rth.dk/resources/risearch.
Supplementary information: Supplementary data are available at Bioinformatics online.
Non-coding RNA (ncRNA) form an abundant class of genes involved in both regulation and housekeeping functions, often in complexes with proteins and/or through interactions with other RNAs (Amaral et al., 2008). The potential for ncRNAs is becoming apparent, e.g. in the mammalian genome where the protein coding regions account for ~1.2% (International Human Genome Sequencing Consortium, 2004) while the majority of the genome is transcribed (The ENCODE Project Consortium, 2007). Even in smaller genomes, such as fungi strains of Aspergillus, ‘only’ 45–50% of the genome encodes proteins (Galagan et al., 2005), leaving plenty of room for ncRNAs which could hold the potential for improvements to microbial industrial production, such as has already been investigated in Streptomyces (D’Alia et al., 2010). Also in mammalian production systems such as Chinese hamster ovary cell lines, ncRNAs receive increasing attention (Barron et al., 2011).
Computational screens for structured RNAs result in thousands of candidates on a genome-wide scale and it is of interest to predict possible RNA interaction partners of these sequences (Gorodkin et al., 2010; Gorodkin and Hofacker, 2011). These candidates are predicted from sequence and structure-based alignments, by using a combination of thermodynamic and evolutionary constraints (such as compensating base pair changes) (Pedersen et al., 2006; Torarinsson et al., 2006, 2008; Washietl et al., 2005; Weinberg et al., 2007). A principal problem, however, is to obtain experimental data for each of these candidates, such as the full-length RNA sequence and its function. Another factor is that many ncRNAs are expressed at extremely low levels. For example, the regulatory antisense RNA of the GAL10 operon in yeast is functional and expressed as one copy per 14 cells (Houseley et al., 2008). Coping with such low expressed ncRNAs on a high-throughput experimental scale is still an intractable task.
One approach toward functional analysis of ncRNA candidates is to search for possible interactions with other RNAs, as a substantial class of ncRNAs function by duplex formation with other RNAs, of which microRNAs (miRNAs) are a popular example. However, not only small ncRNAs interact by base pairing, but also long ncRNAs. For example, Staufen 1-mediated messenger RNA decay (SMD) can be initiated by imperfect base pairing between Alu elements in a lncRNA and in the 3′-UTR of an SMD target (Gong and Maquat, 2011). Translational regulation by short RNAs (sRNAs) is also a common mechanism existing in bacteria (Waters and Storz, 2009). A well-known example is the MicC-ompC messenger RNA (mRNA) interaction causing translational repression (Chen et al., 2004; Vogel and Papenfort, 2006).
It is furthermore of interest to use the homology information from multiple structural alignments of ncRNA candidates for interaction prediction. Recent computational approaches that address this are PETcofold (Seemann et al., 2011) and ripalign (Li et al., 2011). Other methods that work on pairs of single sequences include PairFold (Andronescu et al., 2005), RNAup (Mückstein et al., 2006) and inteRNA (Alkan et al., 2006). However, such approaches which take both intra- and intermolecular base pairings into account seem in general to be less suitable for genome-wide interaction screens.
It is therefore relevant to develop fast methods to search for putative RNA–RNA interactions not only for existing ncRNA candidates, but also to find novel RNAs which, for example, hold the potential to be lineage specific or perform lineage-specific functions (Bentwich et al., 2005; Ohhata et al., 2011).
The computational prediction of RNA–RNA interactions is a rapidly growing research area. As discussed in Seemann et al. (2011) , several classes of algorithms with increasing complexity in time and memory were developed. Here, we consider algorithms that use a thermodynamic energy model for predicting intermolecular interactions, while ignoring intramolecular structures. An early method was RNAhybrid (Rehmsmeier et al., 2004) which makes use of the full energy model, only excluding intramolecular base pairings and multiloops. This yields a complexity of in memory and in time, with m and n being the lengths of query and target, respectively, and l the maximum loop length. Hodas and Aalberts (2004) explicitly noted the analogy between Smith–Waterman sequence alignment (Smith and Waterman, 1981) and intermolecular RNA pairing. Their implementation in BINDIGO handles different loop types with different states and dynamic programming (DP) matrices. A recent and faster method is RNAplex (Tafer and Hofacker, 2008; Tafer et al., 2011), which uses a simpler energy model in the first ‘scanning phase’. It approximates larger loops with a linear model, thereby discarding the length-dependent term and thus the factor in time complexity. It also reduces memory consumption to a linear scaling. In a second step, the full energy model can be used on sub-sequences to refine potential binding sites. Even though RNAplex decreases time complexity while maintaining a thermodynamic model, applying it to genome-wide screens is still a computationally demanding task.
Here, we simplify the energy model even further, with the goal of designing a screening method for RNA–RNA interactions that can be used as a pre-filter for computationally complex methods in genome-wide screens. Consequently, our goal here is to search for near-complementary regions. To our knowledge, the fastest tool for such a task is RNAplex which recently was extended to use the information from multiple sequence alignments as well as pre-computed accessibility profiles. However, when searching for near-complementary regions, the energy model of RNAplex can be simplified further. Here, we introduce RIsearch, which implements a simpler state model that essentially takes all stacked base pairs into account, but averages costs over loop openings, internal loops and bulges. This is realized by an alignment-like algorithm that treats dinucleotides as the elementary unit of a 36 × 36 scoring matrix. An additional layer of heuristics can be used for genome-wide searches, which reduces the search space by first identifying short stretches of complementarity and extending those with RIsearch. It is beyond the scope of this work to fully address this additional step. One major consideration here is the G–U wobble, which has been addressed earlier, e.g. in GUUGle (Gerlach and Giegerich, 2006).
The underlying algorithm of RIsearch can be seen as an extension of the Smith–Waterman–Gotoh algorithm (Gotoh, 1982; Smith and Waterman, 1981) for local sequence alignment. To find putative interaction sites, we look for complementarity rather than similarity/identity. A similar idea has been used in the so-called ‘individual base pair model’ introduced by TargetRNA (Tjaden et al., 2006; Tjaden, 2008). The crucial difference is that our scoring scheme is based on dinucleotides instead of single nucleotides. This allows us to reflect the main properties of the nearest-neighbor free energy model, which is widely applied for RNA folding (see below for details on the scoring scheme). It can also be considered as a scoring scheme taking di-residue substitutions into account with gaps being an explicit part of the scoring scheme (Akbasli, 2008). An alignment approach making use of di-residue substitutions, but with other gap scoring was introduced as well (Crooks et al., 2005).
When neglecting intramolecular base pairing, the only structural elements that need to be considered are stacked pairs, bulges and interior loops. Contributions from dangling ends and terminal mismatches are excluded to keep the algorithm simple and fast. We use a three-state model (Fig. 1), with an M-state for consecutive matches (stacked pairs) and mismatches (interior loops) and Bq/Bt states for gaps (bulges) in either sequence (query q or target t). All interactions come from the I(nitiation)-state and terminate in E(nd). All scores S are defined in one scoring matrix, where denotes the energy for stacking the base pair () on (). refers to the ith nucleotide in the query, to the jth nucleotide in the target. Both are indexed in 5′- to 3′-direction from 1 to m, respectively, n. As two RNA strands interact running in opposite directions (antiparallel), target indices are decremented when query indices are incremented.
For each of the main states, we maintain one table (DP matrix) where contains the maximum score of an interaction that ends ungapped at position i of the query and j in the target. Accordingly, Bq and Bt hold maximum scores for sub-alignments ending in a gap.
To lower memory usage from , we split the algorithm in two steps. In the first step, RIsearch scans for possible interaction sites. By approximating loop energies with a linear (affine) model, only two rows of each DP matrix need to be stored. All scores for the current row only depend on the previous row, achieving a linear space requirement. During this phase, we store for each row the maximum score and its position in the query ( and ). This results in a space complexity of with m and n being the lengths of the query and target sequence, respectively. This approach introduces an ambiguity, as each position in the target can only be linked to one position in the query. If there are multiple sites within the query that bind to the same region of the target, the weaker interaction might be missed. In practice, however, this does not cause problems when the query is a short sequence. In the second step, all entries in E that exceed a given threshold are processed. For this, a region of 40 nt (or a user-specified amount) downstream of the identified positions is taken into account to compute the actual structure and free energy of the duplexes. In this way, RIsearch needs only marginally more time to identify suboptimal interactions.
This approach is similar to RNAplex. We reach a further simplification by (i) not using an extra state for interior loops, and (ii) also approximate small interior loops with the affine model instead of relying on the look-up tables. (iii) RNAplex seems to incorporate dangling end contributions even though this is not stated in their paper. Fewer states lead to a less complex recursion and other differences are due to algorithmic design (most notably our dinucleotide matrix). This also holds true for a comparison with BINDIGO that distinguishes different types of bulges and interior loops depending on their size and degree of asymmetry and includes terminal stacks, thus leading to a more complex recursion.
The RIsearch recursion is given as
Entries in the M-state (ungapped) can come from (i) the M-state by extending the previous alignment with one residue on both strands, meaning either a stacking of a new pair, or the symmetric extension of an interior loop (or opening, closing an interior loop). The entries can (ii/iii) come from a dash (gap) in either sequence, reflecting the closing of a bulge or the closing or continuation of an asymmetric interior loop. (iv) A new alignment can be started and (v) 0 is given if this pair should not be part of the interaction. In the implementation, the latter two cases are merged into one, implicitly handled by the scoring matrix. All scores where and do not form a Watson–Crick or wobble base pair, and therefore should not start an alignment, are set to zero. Elements in Bq and Bt can either come from the M-state by opening a new ‘gap’ or from the same state by extending an existing ‘gap’. As mentioned before, the row maximum is stored in a one-dimensional array and the corresponding position i within the query in . One possibility to allow a position in the target to be related to more than one position in the query is to make these arrays two-dimensional, giving room to store the second and third best interaction in and .
Values for the 36 × 36 scoring matrix (Supplementary Fig. S1) are derived from the Nearest Neighbor Database (NNDB) (Turner and Mathews, 2010). The NNDB contains parameters, determined from optical melting experiments, that allow prediction of free energy changes of the different RNA structural elements (stacked pairs, loops) and are widely used in software for RNA folding. It provides complete nearest-neighbor sets, including rules and parameter values, along with tutorials. Considering the current and previous position of the two sequences allows us to apply stacking energies for the base pairs. Bulge and interior loop energies usually contain a length-dependent term, but are here approximated by an affine model. Figure 2a shows that the affine ‘gap’ model is exact for bulge sizes 2–6, and over-penalizes larger loops. A single-nucleotide bulge in our model receives the bulge opening cost and possibly a penalty for terminal A–U or G–U pairs. This is not the case for the full energy model, in which the stacking energy of the enclosing base pair is awarded. The look-up tables for interior loops of sizes 1 × 1, 1 × 2 and 2 × 2 cannot be incorporated into our scoring scheme. Instead, all energies have to be approximated by an affine model with opening and extension penalties, as depicted in Figure 2b. Interior loops with >14 nucleotides are over-penalized.
We created two matrices, one based on the so-called Turner 1999 energy parameters (Mathews et al., 1999) and one based on the Turner 2004 set (Mathews et al., 2004). The latter is the default, but the user can choose either. Energy contributions of stacking and bulges are largely identical, the scoring differs for interior loops. Supplementary Table S1 shows the free energies of example duplexes as modeled by RIsearch and other methods.
There are some ambiguous cases in the scoring matrix, for example in . We cannot tell whether it is the extension of a bulge (0.4 kcal/mol) or the asymmetric extension of an already asymmetric interior loop (0.6 in T04 and 0.48 in T99). For this case, we decided to just assign the bulge loop extension penalty, as larger loops are already over-penalized.
To favor short stable interactions, the user can choose a per-nucleotide penalty [like in Tafer and Hofacker (2008)], which is then directly integrated in the scoring matrix.
We benchmarked our method on several datasets containing simulated data and two real-life datasets of bacterial sRNA and human miRNA with their respective targets. Accuracy and runtime of RIsearch were compared with previous methods. We mainly focused on a comparison with RNAplex, because it belongs to the same class of algorithms and has already been benchmarked against a variety of other tools.
This dataset consists of random sequences of different lengths (20–50 nt, in steps of 5 nt) covering a variety of GC-contents (10, 30, 50, 70 and 90%). For each class, 1000 sequences were generated. First, the perfect complement was derived. Then, this hypothetical optimal binding partner was mutated stepwise as follows: A random position in the sequence was chosen to be substituted with 75% probability, deleted with 17% or a nucleotide inserted (8%). The number of repetitions is length dependent to ensure a wide range of Levenshtein distances (LDs) between the optimal and the mutated target, yielding duplexes for each combination of length and GC-content.
For each of these duplexes, we calculate the minimum free energy (MFE) with different tools, namely DuplexFold from the RNAstructure package (Reuter and Mathews, 2010), which implements the NNDB 2004 rule set, RNAplex (Tafer et al., 2011) and RNAcofold (Bernhart et al., 2006b), which both use the Turner 1999 parameters and RIsearch with the 2004 matrix.
To benchmark time (and also memory) consumption, we use three different sets of sequences. Random data as well as genomic sequences of various lengths and different number of query sequences are contained in those. For details, see Section S2.1 in the Supplementary Material.
This dataset comprises a total of 17 sRNA–mRNA interactions with experimentally verified binding site positions. It has been used before as a benchmark set by Busch et al. (2008), Chitsaz et al. (2009) and lately by Tafer et al. (2011) , from which sequences were taken. Query and target sequences are on average 147 nt and 179 nt long. The specific sRNA–mRNA interactions appear in the table in Section 3. Since RNAplex was previously benchmarked on a range of tools (Tafer and Hofacker, 2008; Tafer et al., 2011), we only compare RIsearch (both parameter sets) with RNAplex (with and without taking into account accessibility). Accessibility profiles were computed by RNAplfold (Bernhart et al., 2006a) with parameters as suggested by Tafer (RNAplfold -W 240 -L 160 -u 40 -O.)
For this benchmark we do not consider suboptimal duplexes initially, but only whether the first reported hit corresponds to the experimentally verified interaction site. We present the deviation between predicted and experimentally verified duplex boundaries as was also done by Tafer et al. (2011). To evaluate the performance, we additionally calculated the sensitivity (SEN, also called true-positive rate) and the precision [also called positive predictive value (PPV)] and the harmonic mean of both (also known as F-measure) as was also done by Kato et al. (2010) and Salari et al. (2010). The definitions are
We counted true positives (TPs), false positives (FPs) and false negatives (FNs) by comparing the verified interacting base pairs with predicted ones. An alternative measure is the Matthews correlation coefficient (Matthews, 1975) , which in this case [since the number of true negatives (TNs) is orders of magnitudes higher than TP, FP and FN] reduces to the geometric mean of the sensitivity and the PPV: (Gorodkin et al., 2001). For cases where the first prediction did not overlap the experimentally verified location, we then also computed suboptimal solutions.
We examine a subset of human microRNAs causing mRNA repression from TarBase (Papadopoulos et al., 2009). This dataset of 27 interactions has been used before to benchmark RNAplex (Tafer and Hofacker, 2008). The mature miRNA sequences from miRBase 17 (Kozomara and Griffiths-Jones, 2011) are used as query and the 3′-UTRs of the respective mRNAs from UCSC hg19 (average length 2294 nt) are used as target. The experimentally confirmed binding sites were collected from the original papers and mapped to the extracted sequences. For some miRNA–mRNA pairs, there is more than one verified interaction site. The majority have only one or two binding sites (13 and 9), few have up to four, the interactions including KRAS and NRAS forming the exception with eight and nine possible binding sites, respectively (Johnson et al., 2005). The 27 miRNA/UTR pairs were scanned with RNAhybrid, RNAplex and RIsearch, allowing for suboptimal hits. For RIsearch, we used the 1999 energy parameters, as the other two methods also use them. An interaction counts as ‘recovered’ when the predicted target region overlaps any of the annotated binding sites. For interactions with more than one experimentally verified binding site, we considered the one ranked highest (according to their predicted free energy) for each method.
We used the same 27 interactions to test performance on large-scale screens. Instead of only the 3′-UTR sequence, we used the whole repeat-masked chromosome where the known target is located. We also included GUUGle as well as TargetScanS (Garcia et al., 2011) and miRanda (Enright et al., 2003), which both are specifically designed for miRNA target prediction. In this screen, we excluded RNAhybrid, because it requires to fill in the entire DP matrix several times to predict suboptimal duplexes, which makes it too slow for scanning whole chromosomes. For each of the five methods, we count the number of hits which have the same or better score than the best hit that overlaps an annotated target site. Two different measures were used to evaluate the methods. For the first measure, we ranked all methods individually for each miRNA–mRNA pair, where the method yielding the lowest hit count ranks first (using fractional ranking). Given k interactions, let be the count and the rank of method g in the ith interaction. The rank product (Breitling et al., 2004) is given as the geometric mean: . Because this is an ordinal measurement, we define a second measure, the relative hit score, which takes account for the degree of difference between the hit counts of the different methods. We first find the maximum count N for each interaction: and define .
To demonstrate the efficacy of RIsearch as a filter, we first use RIsearch (with free energy threshold of −11 kcal/mol) and apply TargetScanS and miRanda on the pre-filtered data. We measure the reduction in candidate regions compared with the raw results of the two miRNA target predictors. We compare this with the filter abilities of GUUGle (requiring a seed match of at least seven nucleotides) and a combination of the two.
To evaluate the accuracy of the scoring scheme, we created a set of random sequences for different combinations of length and GC-content as described above. We address not only score (energy) computations, but also the ranking by the various tools. In Figure 3a, we show how the different methods deviate in their scoring. RNAcofold was included here because it considers intramolecular base pairs resulting in lower energies for sequences that are further away from perfect complementarity. The energy deviation to the other tools increases with LD. RIsearch typically yields higher energies; however, the results are ranked similarly to the other methods. When comparing duplex free energies as computed by RIsearch and DuplexFold (Fig. 3b), we get a Pearson product-moment correlation coefficient r of 0.99 (P-value 2.2e−16). The Spearman’s rank correlation coefficient ranges between 0.98 and 0.99. Furthermore, when looking at the 5% highest ranking duplexes (those with the lowest G), the overlap in candidates is quite substantial (Fig. 3c). Even though the free energies computed by RIsearch deviate from energies computed by a method using the full energy model (see comparison with DuplexFold in Fig. 3d), we have shown that the general ranking of the duplexes correlates well with the ranking by DuplexFold. See Supplementary Figure S2 for sequences of other lengths and GC-content.
As RNAplex has already been shown to be much faster than alternative methods, we only compare the performance of RIsearch to RNAplex on different datasets.
On a small dataset of 19 bacterial sRNAs and 100 target sequences (each 1200 nt long), RNAplex takes around 25 s to predict all optimal duplexes, whereas RIsearch only needs 9 s on a standard laptop (Intel C2D @2.53 GHz) (Supplementary Table S2).
To prove that whole-genome scans become feasible, we queried whole human chromosome 1 with one miRNA and got a speedup of 181-fold. With more sequences in the query, this drops drastically. From the data shown in Supplementary Table S3, it seems that RNAplex uses a substantial amount of time for the initialization. But also after correcting for that, RIsearch still shows a significant speedup.
To get a more complete picture, we generated random sequences with lengths of 10, 100, 1000 and 25 nt (the latter to represent ncRNAs of the type miRNA or small interfering RNA) as queries as well as target sequences in order of magnitude steps between 1000 and 1 GB. The speedup of RIsearch over RNAplex grows with decreasing query and increasing target lengths (Supplementary Table S4). The extreme speedups we see for large target sequences should be noted with caution, as they reflect the same initialization issue as mentioned above. This overhead in the initialization cannot be explained with more advanced options to RNAplex, parameters were chosen to yield fastest runtimes. The overhead might just be an implementation issue.
Overall, we observe a worst case speedup of around 2.4. Peak-memory consumption is typically reduced by a factor 1.44, i.e. RIsearch uses 69% of the memory that RNAplex uses. With short target sequences (1000 nt), this drops to 43%.
For all of these benchmarks the simple version of RNAplex was used, i.e. not taking into account accessibility. We also tested the version including accessibility profiles and found it substantially more resource demanding than the regular RNAplex, which itself in its current implementation is considerably slower than RIsearch (Supplementary Section S2.2).
The precision of RIsearch (with 99 and 04 Turner parameters) was compared with RNAplex (with and without accessibility) on a real-life dataset of 17 bacterial sRNA–mRNA interactions. Although RNAplex-a (with accessibility) predicts 16 interactions that overlap the known binding sites, RIsearch with both energy parameter sets and RNAplex-c (without accessibility) each recover 12 of the known interaction sites (Table 1). There are two cases (GcvB-STM4351 and MicC-ompC) where each of those three methods predicts the same energetically more stable interaction than the annotated one. When measuring the amount of overlap of predicted and annotated base pairs (Supplementary Table S5), RNAplex-a performs best in all measures, with an average sensitivity and PPV (0.787 and 0.736) higher than RIsearch04 (0.656 and 0.641) and the other methods (ranging behind), because it only misses one interaction, not five. However, if suboptimal solutions are additionally taken into account, RIsearch04 has a better average sensitivity of 0.919, compared with 0.846 and 0.917 for RNAplex with and without accessibility, respectively. RIsearch (with its default scoring matrix) then also outperforms RNAplex in terms of PPV (RIsearch04: 0.898, RNAplex-a: 0.785 and RNAplex-c: 0.821), F-measure and Matthews correlation coefficient. In all measurements, the 2004 Turner parameters lead to a higher prediction accuracy than the 1999 parameters in RIsearch.
We also compared the recovery rates of RIsearch, RNAplex and RNAhybrid for 27 human miRNA–target UTR duplexes. Predicted binding sites were ranked according to their free energy and we report the highest ranking prediction that overlaps an annotated binding site in Table 2, as it has been done by Tafer and Hofacker (2008). For 23 of the interactions, the three methods report the same binding site as the highest scoring true-positive hit. For example, TarBase contains four possible binding sites for the let-7e miRNA in the 3′-UTR of SMC1A mRNA. They constitute the top ranking candidates of all the methods tested here, but in different order. The interaction that has been experimentally verified (Kiriakidou et al., 2004) is ranked first by RIsearch and second by the other methods. In this benchmark, RNAhybrid performs best, known target sites are ranked higher and predicted more accurately in position. RNAplex is slightly better than RIsearch in ranking the real interactions (on average 1.92 compared with 2.07), while RIsearch predictions are generally closer to the verified binding site. The frequent one-nucleotide deviation in position could be an artifact of the different handling of dangling ends and terminal mismatches.
When screening the whole chromosome, we observe very different levels of specificity (Supplementary Table S6). TargetScanS fails to find 2 out of the 27 interactions, because it requires a perfect seed match that is not present in those target sites. With default parameters, miRanda misses three interactions. We have not tried any other parameter setting. GUUGle alone performs worst (in terms of RP and RHS), but when combined with RIsearch or RNAplex, respectively, we get an miRNA target predictor comparable to the specialized methods. In this combination, RIsearch scores best in RHS and slightly behind the two specialized methods in RP.
When applying this combination as pre-filter for TargetScanS and miRanda, we achieve on average a reduction of candidates by around 37% for both tools. RIsearch without the GUUGle-prefilter accounts for an average reduction of 27% for miRanda candidates and 35% for TargetScanS, in some cases of up to 70% (Supplementary Table S7). The degree of reduction seems dependent on the GC-content of the miRNA (see Supplementary Material, page 11).
The prediction of thousands of potential miRNA targets is in agreement with a recent hypothesis of miRNA response elements connecting mRNAs, transcribed pseudogenes and long ncRNAs in large-scale regulatory networks (Salmena et al., 2011).
Here, we illustrate how RIsearch can be used as filter for the more complex algorithms, such as IntaRNA or RNAup. We extracted the sequences from Escherichia coli and Salmonella typhimurium according to the IntaRNA paper (Busch et al., 2008). In Figure 4, we show a receiver-operating characteristic curve of the recall against search space reduction given different energy cutoffs, with an area under the curve of 0.817. It shows that with a rather conservative cutoff of −10 kcal/mol, we can already filter out 16.1% of the candidates. With −11.5 kcal/mol, we still retain all true targets, while reducing the search space by 27.5%. Considering that IntaRNA uses around 9 h 40 min (and RNAup even 18 h 50 min) to compute all 47.726 duplexes, while RIsearch takes <18 s, this shows how genome-wide searches can be speeded up.
RIsearch is a fast method to search for near-complementary base pairing in genomic sequence by using a simplified energy model. In a runtime benchmark, we show that RIsearch is at a minimum a factor of 2.4 faster than RNAplex, the currently fastest method for predicting near-complementary duplexes and also has a lower memory consumption. Remarkably, our simplified model gives good energy estimates for complementary duplexes interspersed with small bulges and interior loops. Even though RIsearch systematically computes an energy differing by a small factor compared with the full Turner energy model, the reported energies strongly correlate (r = 0.99) with the energies computed by the full model in DuplexFold. When ranking random duplexes by their predicted energies, RIsearch shows on average an overlap of 94% with DuplexFold and RNAplex within the highest ranking duplexes.
In our evaluation of prediction accuracy on the sRNA–mRNA and miRNA–mRNA datasets, we show that RIsearch achieves a similar or better sensitivity and precision for predicted base pairs as other compared methods. However, considering the accessibility of binding sites with RNAplex can increase the recovery rate of the verified sRNA–mRNA interactions. Other approaches, such as IntaRNA and RNAup, also account for accessibility by computing intramolecular base pair probabilities in both sequences. This, however, comes at the expense of runtime. In contrast, the objective of RIsearch is the fast search for potential RNA–RNA duplexes. One application is pre-filtering in genome-wide screens. When RIsearch is used as a pre-filter for specialized miRNA target predictors, such as miRanda and TargetScanS, the number of target site candidates can be significantly reduced, which in turn results in a better precision of the miRNA target prediction.
Problems with developing methods that should be applied on a genome-wide scale include reliable testing. We face a lack of experimentally verified interactions. For many miRNAs for example, target genes have been identified, but the actual binding site positions within their 3′-UTR are unknown (Lindow and Gorodkin, 2007). Even for known target sites, the extent of the interactions is not clear, for examples of bacterial sRNA target sites, see Sharma et al. (2007). Even though we could benchmark on a limited dataset, benchmarking the accuracy of genome-wide searches for RNA–RNA interactions is currently hard given the limited amount of known interactions. In particular, it is not possible to reliably calculate the false-discovery rate unless follow-up experiments are carried out. Another factor is the estimation of P-values, which requires a background model for RNA–RNA interaction to distinguish true positives from random hits. This, however, depends on incorporating reliable shuffling schemes, e.g. based on dinucleotide composition, similar to those for de novo prediction of ncRNA genes (Gorodkin and Hofacker, 2011).
RIsearch was developed as a tool, which can conduct a fast initial screen in genomic sequence and aid in the overall goal of mapping all potential RNA–RNA interactions in, e.g. the human genome. However, it is beyond the scope of this work to set up such a pipeline, which most likely involves additional methods, taking the full energy model into account, as well as the development of a framework for computing P-values. The estimation of P-values by TargetRNA points to a direction for this. For further future directions, prediction of RNA–RNA interactions could be combined with high-throughput experimental data, such as done for RNA structure prediction (Deigan et al., 2009; Kertesz et al., 2010; Underwood et al., 2010).
The introduced simplifications will make a hardware implementation of the algorithm, e.g. with a field-programmable gate array, more feasible. Hardware accelerated versions of the Smith–Waterman algorithm have been shown to be magnitudes faster than traditional software implementations (Li et al., 2007).
The authors thank Peter Sestoft, who was involved in the development of the state model and the initial alignment version of the algorithm. They further thank Larry Croft for critical reading of this article, and the anonymous reviewers for their valuable comments and suggestions.
Funding: Danish Council for Independent Research (Technology and Production Sciences); The Danish Council for Strategic Research (Programme Commission on Strategic Growth Technologies) as well as the Danish Center for Scientific Computing.
Conflict of Interest: none declared.