|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Many RNA molecules function without being translated into proteins, and function depends on structure. Pseudoknots are motifs in RNA secondary structures that are difficult to predict but are also often functionally important.
Results: TurboKnot is a new algorithm for predicting the secondary structure, including pseudoknotted pairs, conserved across multiple sequences. TurboKnot finds 81.6% of all known base pairs in the systems tested, and 75.6% of predicted pairs were found in the known structures. Pseudoknots are found with half or better of the false-positive rate of previous methods.
Availability: The program is available for download under an open-source license as part of the RNAstructure package at: http://rna.urmc.rochester.edu.
Supplementary information: Supplementary data are available at Bioinformatics online.
In many cases, RNA functions in cells without being translated into proteins. These non-coding RNAs (ncRNAs) carry out a wide range of functions, including catalysis, intracellular protein trafficking, immunity and gene regulation (Doudna and Cech, 2002; Gesteland et al., 2005; Marraffini and Sontheimer, 2010; Nissen et al., 2000; Tucker and Breaker, 2005; Walter and Blobel, 1982; Wu and Belasco, 2008). The functions of these ncRNAs depend on their structures.
RNA structure is hierarchical, beginning with the sequence of nucleotides (the primary structure), then the set of all canonical, i.e. GC, AU and GU base pairs (the secondary structure), and ultimately the full 3D positioning of all the atoms in the molecule (the tertiary structure). The secondary structure tends to form on a faster time scale and with stronger interactions than those interactions that fold the tertiary structure, so the secondary structure can usually be predicted independently of knowing the overall tertiary structure (Tinoco and Bustamante, 1999).
The most accurate method of secondary structure determination is comparative sequence analysis. It relies on the hypothesis that sequences that serve the same function share the same structure. But comparative sequence analysis is impractical in many situations because it requires significant human insight and a large number of diverse sequences (Gutell et al., 2002; Woese and Pace, 1993). Commonly, the secondary structure for a single sequence is predicted using free energy minimization. These methods use a set of thermodynamic parameters that predict the folding free energy change of a given structure and a dynamic programming algorithm to find the structure with lowest free energy change (Mathews and Turner, 2006; Mathews et al., 2004).
The thermodynamic parameters can be used with a different type of dynamic programming algorithm to predict a partition function, which can be used to predict the base pairing probability for all possible canonical pairs (McCaskill, 1990). Base pair probabilities can be used to assemble structures called maximum expected accuracy (MEA) structures (Do et al., 2006; Hamada et al., 2009; Knudsen and Hein, 1999; Lu et al., 2009). Briefly, MEA prediction seeks to maximize the sum of probabilities of each base pair for all paired bases and the probabilities of being single-stranded for all unpaired bases. Empirically, MEA structure prediction works because that base pairs predicted to be the most probable are the most likely to be in the known structure (Mathews, 2004).
Although comparative sequence analysis is not yet automated, a number of methods have been developed to improve structure prediction by using the information available in homologous sequences. (Bernhart et al., 2008; Harmanci et al., 2007, 2008; Mathews and Turner, 2002; Steffen et al., 2006; Will et al., 2007; Xu et al., 2007). These methods use a supplied input sequence alignment, predict the structure for each sequence and then find the common structure, or align and fold the sequences simultaneously. These programs have been reviewed previously (Bernhart et al., 2008).
The recently published TurboFold algorithm is one such method that improves secondary structure prediction by predicting a conserved structure using two or more homologous sequences (Harmanci et al., 2011). It refines base pairing probabilities of an arbitrary number of sequences using their probabilistic alignments in an iterative manner. Structures are then assembled using a MEA algorithm from the final set of base pairing probabilities.
Because of computational complexity, most programs that predict secondary structure do not predict pseudoknots. Formally, a pseudoknot occurs if an RNA has two base pairs, one between bases i and j, and a second between i′ and j′, such that i< i′ <j< j′. Pseudoknots make up only a small fraction of base pairs, but they are often found in ncRNAs. Predicting the minimum free energy secondary structure that includes pseudoknots of any topology has been proven to be NP-hard (Lyngsø and Pederson, 2000). In spite of this, a number of single-sequence methods have been developed to predict structures with pseudoknots (Akutsu, 2000; Condon et al., 2004; Dirks and Pierce, 2003; Reeder and Giegerich, 2004; Rivas and Eddy, 1999; Uemura et al., 1999). These methods limit the set of possible pseudoknot topologies to accelerate the search. A number of available methods has been previously reviewed (Liu et al., 2010).
Recently, a new algorithm was published, ProbKnot, that uses a single sequence partition function to predict pairing probabilities and assigns base pairs if the two bases are mutually maximally probable to pair with one another (Bellaousov and Mathews, 2010). This algorithm can find pseudoknots of any topology. While the partition function does not include terms for pseudoknotted structures, the ProbKnot algorithm was able to find pseudoknotted base pairs with reasonable accuracy compared with other methods, and at minimal computational cost: the algorithm is O(N2) in addition to the O(N3) partition function calculation.
Approaches to the pseudoknot prediction problem that employ information from sequence and structural homology are few in number. Like single sequence methods (Abrahams et al., 1990; Isambert and Siggia, 2000), a Monte Carlo alignment and folding method has also been published (Meyer and Miklos, 2007). Also analogous to single sequence methods (Dawson et al., 2007; Jabbari et al., 2008; Ren et al., 2005), another approach identifies common pseudoknot-free structures and then iteratively matches regions that can form pseudoknots (Ruan et al., 2004). A third approach computes a score matrix based on alignment and thermodynamic information and uses a maximum weight matching (MWM) algorithm to assemble an optimal secondary structure (Tabaska et al., 1998; Witwer et al., 2004). The latter two approaches are dependent on a high-quality sequence alignment as input, either one assembled manually or one computed from sequences with a large degree of sequence identity (70% or more). Such alignments are not always available, particularly for newly discovered classes of RNAs.
This contribution reports a new algorithm, TurboKnot, for predicting RNA secondary structures conserved in multiple sequences, including pseudoknots. TurboKnot assembles structures in the same way as ProbKnot, but with the pairing probabilities predicted by a TurboFold calculation using multiple sequences to inform the pairing probabilities. The additional computational cost compared with TurboFold is small compared with the TurboFold calculation itself, just O(MN2), where M is the number of sequences used and N is the length of the longest sequence. It is benchmarked against two algorithms capable of predicting pseudoknots that also use multiple sequences as input, ILM and Hxmatch (Ruan et al., 2004; Witwer et al., 2004). It is also benchmarked against the single sequence ProbKnot algorithm (Bellaousov and Mathews, 2010). Finally, two algorithms that cannot predict pseudoknots are included in the benchmark, TurboFold (Harmanci et al., 2011) and MEA (MaxExpect) (Lu et al., 2009).
For TurboKnot, base pair probabilities were predicted using TurboFold (Harmanci et al., 2011). This algorithm employs the nearest neighbor thermodynamic parameters (Mathews et al., 1999, 2004; Xia et al., 1998). The per-branching-helix bonus parameter, however, was left at -0.6 kcal/mol, consistent with optical melting data (Diamond et al., 2001), rather than using the optimized parameter of Mathews et al. (2004). TurboFold was run using four iterations instead of the default of three for an increase in accuracy at the expense of computational cost, but otherwise was run with the default parameters.
Structures were constructed using the same method as the ProbKnot algorithm. First, maximal pairing probabilities for each base i were identified from the TurboFold output and stored in an array Pmax(i). Next, all pairing probabilities for all base pairs, P(i,j) were checked, and if P(i,j) is equal to Pmax(i) and Pmax(j), a base pair was assigned between bases i and j. This process can be iterated, where subsequent iterations only examine the bases that remained unpaired from previous iterations, but subsequent iterations were not found to increase the accuracy of the results (data not shown). Finally, all base pairs involved in helices shorter in length than 3 bp were removed (helices were not considered to be interrupted by single bulges or 1 × 1 internal loops). This structure prediction process was performed independently for each sequence run in a single TurboFold calculation.
MWM calculations were performed with Wmatch from MATHPROG (http://elib.zib.de/pub/Packages/mathprog/matching/weighted/) (Micali and Vazirani, 1980). Edges were only supplied to the algorithm for consideration if their computed pairing probability was ≥10−6. As with TurboKnot, all helices shorter than 3 bp were removed from the structure.
Sensitivity and positive predictive value (PPV) were used to evaluate each algorithm tested. Sensitivity measures the fraction of pairs in each known structure found:
PPV measures the fraction of pairs predicted that are correct:
Because it is difficult to determine the exact register of pairs by comparative analysis and because pairing can fluctuate because of thermal motions, predicted base pairs between i and j were counted as correct if i was paired with j, j − 1, or j + 1, or if j was paired with i − 1 or i + 1. Assessment using only exact matching base pairs is supplied in Supplementary Tables S1 and S2.
All sequences were tested on a randomly chosen subset (with replacement) of seven different subtypes of RNA: 5S RNA, tRNA, RNase P RNA, SRP RNA, tmRNA, telomerase RNA and group I introns (Brown, 1998; Chen et al., 2000; Damberger and Gutell, 1994; Sprinzl et al., 1998; Szymanski et al., 1998; Waring and Davies, 1984; Zwieb and Wower, 2000). For all multisequence methods, five RNAs of each type were selected to be computed together in each calculation. Particular RNA sequences could be chosen more than once as a part of different calculations. These were considered to be separate results, as multiple sequence methods may give different results for the same sequence depending on the other sequences used in the calculation. For consistency, duplicate sequences were also counted as independent calculations when using single-sequence methods to predict structures, even though the resulting structures would be identical. Average values of sensitivity and PPV were computed for each class of RNAs. The overall average is the mean of the individual averages on sequence types.
Pseudoknots were identified using the algorithm of Smit et al. (2008) to identify the fewest pairs that could be removed to remove the pseudoknot from the structure. Predicted pseudoknotted base pairs were only counted as correct if the i−j pair was in the accepted structure (allowing thermal fluctuation as in Section 2.4), the pair was counted as a pseudoknot by the Smit et al. algorithm, and the pair was pseudoknotted. A pair was considered pseudoknotted if it met the i< i′ <j< j′ criterion, with at least one other pair i′ − j′ that was also a correct pair compared with the accepted structure (Bellaousov and Mathews, 2010).
Over the range of RNA families tested, which have a variety of functions and structures, TurboKnot averaged a sensitivity of 79.8% (Table 1) and a PPV of 72.9% (Table 2). For comparison, TurboKnot was benchmarked against two other multisequence methods that can predict pseudoknots: ILM version 1.0 (Ruan et al., 2004) and Hxmatch version 1.2.1 (Witwer et al., 2004). Each was run with default parameters. Sequences were aligned for input into these algorithms using MUSCLE version 3.6 (Edgar, 2004), which performs well for RNA sequence alignment compared with other available tools (Wilm et al., 2006). In addition to these two methods, benchmarks were run using ProbKnot (Bellaousov and Mathews, 2010), a single sequence method that can predict pseudoknots, TurboFold (Harmanci et al., 2011), the underlying multisequence method that cannot predict pseudoknots and MaxExpect (Lu et al., 2009), a single sequence method that cannot predict pseudoknots. Other single-sequence methods capable of predicting pseudoknots were recently benchmarked (Bellaousov and Mathews, 2010), and ProbKnot compares favorably against these. It is used here as a representative of this class of algorithms. One algorithm, DotKnot, has been published more recently. Its performance on this test set is included in Supplementary Tables S3–S5 for comparison. Similarly, multisequence algorithms that cannot predict pseudoknots were recently benchmarked (Harmanci et al., 2011), and TurboFold compares favorable and is used as a representative of them here.
The average sensitivity and PPV of TurboKnot were better than any other method capable of predicting pseudoknots. Additionally, TurboKnot had a higher sensitivity than TurboFold, albeit at a trade-off in PPV. Since TurboKnot considers a larger structure space than TurboFold, it is natural to find more correct base pairs with a cost to the PPV.
For pseudoknots in particular, TurboKnot found 289 correct pseudoknotted base pairs out of a total of 5897 in the known structures, and it made 672 false positive predictions of pseudoknotted pairs (Table 3). This is more true positives and fewer false positives than any other tested method. When considering structures with pseudoknots rather than individual pairs, TurboKnot found at least one true positive pseudoknot in 59 structures out of 239 predictions, and out of 426 tested structures that contain at least one pseudoknot (Supplementary Table S6). This is again more true positives and fewer false positives than any other tested method.
The use of an MWM algorithm to assemble structures given the TurboFold pair probabilities was also considered as a control (Supplementary Tables S7 and S8). This approach resulted in the identification of the exact same set of correct base pairs as those based on mutual maximum probability pairs, plus a small number of extra incorrect pairs. The additional, incorrect pairs were so few that they only affected the second decimal places of the PPV percentages. The exception was the tmRNA family, in which a small number of additional correct base pairs were identified. The minor difference between these two approaches was due to how well the TurboFold algorithm refines the pairing probabilities by excluding structures that are not conserved.
As another control, an MWM algorithm was used to predict structures from pair probabilities calculated using a single sequence partition function. This added a number of spurious base pairs (Supplementary Table S9). This emphasized the importance of the ProbKnot approach when single sequences are used.
Time benchmarks were performed on all tested algorithms. Groups of five random sequences were used, with average lengths varying from 80.0 nt to 458.8 nt (Table 4). The Hxmatch algorithm performed the fastest. TurboKnot was significantly slower, but it could still carry out the computations in a reasonable amount of time for a user with a typical desktop computer.
Figure 1 shows a sample structure prediction for Mycobacterium tuberculosis RNase P using TurboKnot, ProbKnot or TurboFold (Brown, 1998). The TurboKnot and TurboFold structures are similar, but the TurboKnot algorithm is able to identify most of the pseudoknotted base pairs. ProbKnot has a lower sensitivity and PPV overall for non-pseudoknotted base pairs in this structure. It identifies the four correct pseudoknotted base pairs that TurboKnot missed, but it also finds eight pseudoknotted base pairs that are not in the accepted structure.
TurboKnot overextends some helices as compared with TurboFold when there are valid bases available to pair. The accuracy of the extra base pairs added when TurboKnot extends helices compared to TurboFold is low; of the base pairs on ends of helices predicted by TurboKnot and not TurboFold, 27.6% were correct in all the structures. This is for cases where a helix is in common in the predictions of TurboKnot and TurboFold. Single nucleotide bulges are not considered to break a helix. These helices that are extended by 1 bp on an end are sometimes correct, so they do increase the sensitivity score of TurboKnot compared with TurboFold but at the expense of PPV.
In the same way as the ProbKnot algorithm, TurboKnot is able to assemble structures containing pseudoknots using base pair probabilities. Using probabilities from TurboFold that are functions of structural alignment and homology information in addition to thermodynamic stability significantly improves predictions when considering all base pairs and when considering just pseudoknots. Compared with TurboKnot, some methods were able to find more true pseudoknotted pairs on some systems. For example, Hxmatch found more true pseudoknotted pairs in telomerase RNA, but in the process it predicted more than twice as many false positive pairs as there were true pseudoknotted pairs in the accepted structures.
Single sequence methods, with or without pseudoknot-prediction capabilities, were reported as having poor performance on the prediction of tmRNA structure (Bellaousov and Mathews, 2010). The results with the sequences used here are consistent with that result. TurboKnot (and TurboFold), however, show significant improvement for this difficult class of RNAs. TurboKnot shows its best pseudoknot-prediction capabilities on this class of RNAs, with 49% of all predicted pseudoknotted base pairs being true positives.
In contrast, Group I introns show slightly worse sensitivity in TurboFold and TurboKnot compared with MaxExpect and ProbKnot, although they did still have higher PPVs. This is likely because the long insertions present in some Group I introns makes accurate sequence alignment difficult. Indeed, ILM and Hxmatch, which are dependent on a sequence alignment algorithm for input, did not perform well with randomly chosen groups of sequences from this class of RNAs. These algorithms may have more success, however, when sequences with higher sequence similarity are used or when using a manually curated alignment, which would allow for more accurate sequence alignment. Such groups of sequences are not always available.
ILM was benchmarked in a previous study using just one sequence rather than an alignment, and, unexpectedly, it appears less accurate when using multiple sequences as input compared with previously reported benchmarks (Bellaousov and Mathews, 2010; Ruan et al., 2004). This is likely because it was hindered by errors made by the sequence alignment algorithm, which aligns sequences but not secondary structure (Edgar, 2004). It fared much better when given manually curated alignments based on structural homology (Ruan et al., 2004). Because Hxmatch, which was given the same input sequence alignments, had similar results on similar classes of RNAs, it is likely tools that consider both secondary structure and alignment at the same time are necessary to accurately align functionally similar RNAs which are not chosen to have a relatively high sequence identity.
While TurboKnot has significantly improved pseudoknot prediction accuracy, more work is needed to raise the accuracy to the current standard for non-pseudoknotted base pairs. Dirks and Pierce (2004) report an algorithm that computes a partition function that includes pseudoknots of some limited topologies in O(N4) time. This partition function could be used by TurboFold and TurboKnot to consider a larger structure space, ideally increasing the probability of true pseudoknots at the expense of false ones that are predicted currently at an acceptable increase in computational expense.
The authors thank A.O. Harmanci and G. Sharma for discussions and S. Bellaousov for performing MWM calculations. Computer time was provided by the University of Rochester Center for Research Computing.
Funding: National Institutes of Health (grant number R01HG004002 to D.H.M).
Conflict of Interest: none declared.