For many RNAs there is a close connection between structure and function. Having a good model of the structure of an RNA is thus critical and often the first clue towards elucidating its function. Determining the complete three dimensional structure (“tertiary structure”) of an RNA is a tedious and time-consuming undertaking. Computational methods – either completely *de novo* or assisted by experimental data – are therefore routinely used to *predict* structure models. A strong focus lies on the prediction of the secondary structure, i.e. the pattern of intramolecular base-pairs (A·U, G·C and G·U) typically formed in RNAs.

2.1. Secondary structure

2.1.1. Thermodynamic folding of single sequences In RNAs, the secondary structural elements are responsible for most of the overall folding energy and can be seen as a coarse-grained approximation of the tertiary structure. This important biophysical property in combination with the fact that secondary structure can easily be formalized as a simple graph (), led to secondary structure being widely studied early on. One of the first attempts to approach the RNA folding problem (i.e. predicting the secondary structure from the primary sequence) was by Nussinov and Jacobson

^{12}. They proposed an algorithm to find the secondary structure with the maximum number of base pairs. It is one of the classical examples of dynamic programming algorithms in computational biology and all modern variants of folding algorithm essentially use the same principle (

Box 1).

Box 1. Dynamic programmingThe Dynamic Programming (DP) paradigm is used for many algorithms related to RNA folding. DP breaks down a problem in smaller sub-problems to find the overall solution efficiently. Nussinov’s algorithm is a classical example of a DP algorithm. Let’s assume we want to find the minimum free energy

*E *_{i,j} between the positions

*i* and

*j* of a sequence and already know the solution for a sequence from

*i* + 1 to

*j*, i.e. a sequence that is one base shorter. The new base

*i* can either be unpaired or forms a base-pair with some position

*k*:

The base-pair

*k* divides the problem into two smaller sub-problems, namely finding the solution for

*E*_{i}_{+1}_{,k−}_{1} and

*E*_{k}_{+1}_{,j}. We thus can find the solution using a recursive algorithm:

*β*_{i,k} is the energy contribution for the base-pair

*i, k* in this simplified energy model.

In practice, however, finding the structure with the maximum number of pairs does not give accurate results. Ideally, we want to find the structure of minimum folding energy. Since most of the folding energy in RNAs is contributed by stacking interactions between neighbouring base pairs, counting single base pairs is not sufficient. Therefore, current folding algorithms use a “nearest neighbour” or also called “loop-based” energy model. A structure is uniquely decomposed into substructural elements (stacked bases, hairpin-loops, bulges, interior-loops, and multi-way-junctions, ). The structural elements are assigned energies which add up to the total folding energy of the structure (). The energy values are established empirically and typically come from systematic melting experiments on small synthetic RNAs. An up-to-date set of energy parameters is maintained by Douglas Turner’s lab^{13,14}.

A dynamic programming algorithm to find the minimum free energy (MFE) for this more complex energy model was proposed by Zuker and Stiegler^{15} and forms the basis for modern prediction programs. The most common implementations used are UNAFold^{16}, RNAfold of the Vienna RNA package^{17} and RNAstructure^{18}. The accuracy of MFE predictions depend on the type of RNA. Although some RNAs can be predicted with high accuracy, in general one has to expect that roughly a third of the predicted base pairs are wrong and one third of true base pairs are missed. It is thus important to keep in mind that even the best currently available prediction methods only give a rough model of the structure. Tertiary interactions, protein context and other inherent limitations of the energy model are all sources of potential errors.

At room temperature, RNAs usually exist in an ensemble of different structures and the MFE structure is not necessarily the biologically relevant structure. There are different algorithms to predict suboptimal structures close to the MFE structure^{19,20}. McCaskill’s algorithm^{21} allows one to calculate the partition function over *all* possible structures and subsequently the probability of a particular base pair in the thermodynamic ensemble. Considering the pair-probability matrix of all possible base pairs gives a more comprehensive view of the structural properties of an RNA than just the MFE prediction. It is also possible to obtain individual structure predictions from the pair-probability matrix, either by sampling^{22} or by finding a structure that maximizes the expected accuracy considering a weighting factor between sensitivity and specificity^{23,24}.

2.1.2. RNA folding using probabilistic models An alternative to the thermodynamic approach of RNA folding is a probabilistic approach based on machine-learning principles. Instead of using energy parameters, folding parameters can be estimated from a training set of known structures and are used to predict structures of unknown sequences. There are several probabilistic frameworks to accomplish parameter estimation and prediction. Stochastic context-free grammars (SCFGs) are a generalization of Hidden Markov models that are widely used in bioinformatics (

Box 2). SCFGs allow one to consider nested dependencies, a prerequisite to model RNA structure. They have been used successfully for homology search problems and consensus structure prediction (see below). Although SCFGs can be used for single sequence structure prediction

^{25} they are not widely used for this problem. CONTRAFOLD

^{23}, an alternative machine learning approach based on conditional random fields, however, could establish itself as a serious alternative to thermodynamic methods. There are also hybrid approaches that try to enhance the thermodynamic parameters by training on known structures

^{26,27}.

Box 2. Stochastic context free grammarsContext free grammars (CFG) are concepts from formal language theory. In most simple terms, a CFG is a set production rules

*V* →

*w* where

*V* represent a so-called nonterminal symbol that produces a string of terminal or non-terminal symbols

*w*. An example of a simple RNA grammar would be

*S* →

*aS*â

aS

Sa

SS

. The grammar has one type of non-terminal symbol

*S* and one type of terminal symbols

*a* *A, C, G, T* representing the bases. The grammar consists of production rules for unpaired and paired bases (

*a*â represent two complementary bases). This simple rules allow to produce all possible RNA secondary structures. A stochastic context free grammar (SCFG) extends CFGs by assigning probabilities to all production rules. In the case of our RNA grammar, a full parametrized SCFG would thus describe the probability distribution over all structures and sequences.

2.1.3. Incorporating structure probing data into folding algorithms Structure probing experiments typically use enzymatic or chemical agents that specifically target paired or unpaired regions^{28}. Most implementations of thermodynamic folding like UNAfold or RNAfold allow for the incorporation of this type of information by restricting the folding to structures consistent with the experimental constraints. As an alternative, experimental information can also be incorporated as “pseudo-energies” into the folding algorithm enforcing regions to be preferentially paired or unpaired reflecting the experimental evidence^{29–31}. Recently, high-throughput sequencing techniques were used to scale up structure probing experiments massively^{32–34}. Dealing with inherently noisy data of this type turned out to be challenging and is still an active field of research.

2.1.4. Secondary structure prediction for homologous sequences Another way to improve secondary structure prediction is to consider homologous sequences from related species. If two or more sequences share a common structure but have diverged on the sequence level, typical base substitution patterns that maintain the common structure can be observed (). A consistent mutation changes one base (e.g. A·U ↔ G·U) while compensatory mutations change both bases in the base pair (e.g. A·U ↔ G·C or C·G ↔ G·C). Clearly, these patterns provide useful information to infer a secondary structure.

The simplest way to exploit this signal is to calculate a mutual information score to find columns that show highly correlated mutation patterns. This method led to surprisingly accurate structures for rRNAs as early as 30 years ago^{35}. Since we rarely have such large datasets of RNAs with extremely conserved structures it is natural to combine co-variance analysis with classical folding algorithms. RNAalifold^{36} extends Zuker’s folding algorithm to multiple sequence alignments by averaging the energy contribution over the sequences and adding co-variance information in the form of “pseudo-energies”. A probabilistic alternative is Pfold^{37}, which uses a simple stochastic context free grammar combined with an evolutionary model of sequence evolution to infer a consensus structure. PetFold^{38} extends Pfold by incorporating pair probabilities from thermodynamic folding and thus unifies evolutionary and thermodynamic information. A more recent program, TurboFold^{39} also uses thermodynamic folding and iteratively refines the energy parameters by incorporating pair probabilities from homologous sequences.

2.1.5. Structural alignment Even unaligned RNAs can provide more information about their common structure than a single sequence. Low sequence homology below of 60% sequence identity^{40} prohibits the sequence alignment-based approach of the previous section (), since correct alignment requires information about the structure. Since structure predictions for single sequences are unreliable, folding the sequences followed by structure-based alignment can also fail.

Therefore, the most successful strategies fold and align the RNAs *simultaneously*. The first such algorithm^{41}, by Sankoff, simultaneously optimizes the alignment’s edit distance and the free energies of both RNA structures applying a loop-based energy model. However, only recent advances made this strategy applicable to practical RNA analysis.

The first complete pairwise Sankoff-implementation Dynalign^{42} implements a loop-based energy model, but employs a simple banding technique for increasing efficiency. A further pairwise Sankoff-like tool is Foldalign^{43}. Computing a loop-based energy model during the alignment, these algorithms are accurate but computationally expensive; in practice, they compensate this by strong sequence-based heuristic restrictions.

Several less expensive Sankoff-like algorithms are based on simplifications introduced by PMcomp^{44}. PMcomp replaces the loop-based energy model by assigning “pseudo-energies” to single base pairs. This reduces the computational overhead significantly. By computing the pseudo-energies of base pairs from their probabilities in the structure ensembles of the single RNAs, accurate information from the loop-based energy model is fed back into the light-weight algorithm.

Approaches following this idea are LocARNA^{45}, foldalignM^{46}, RAF^{47}, and LocARNA-P^{48}; All these tools additionally employ sparsity in the structure ensemble of the single sequences.

The sparsified PMcomp-like approaches are sufficiently fast for multiple alignment and large scale studies, performing e.g. clustering^{45,46} and *de novo* prediction of structural RNA^{49}.

2.2. Prediction with pseudoknots

Pseudoknotted structures follow the same rules as other secondary structures, but allow non-tree like configurations, e.g. due to an additional level of nested base pairs or pairing between different hairpin loops (kissing hairpin) (). Pseudoknots are prevalent in many ncRNAs; still, most algorithms ignore them for technical reasons: pseudoknot-folding is computationally expensive and accurate empirical energy models are missing.

2.2.1. Algorithmic challenges The run-time of pseudoknot-free structure prediction grows only with the cube of the sequence length. Unfortunately, when general pseudoknots are allowed, the computation time grows much faster, namely exponentially with the sequence length

^{50}. Consequently, finding exact solutions is intractable for all but very short RNAs. Note that the “principle of optimality”, which allows dynamic programming (

Box 1) in the pseudoknot-free case, is not applicable in the case of general pseudoknots. By this principle, every optimal structure can be composed from optimal structures of its subsequences.

In practice, often heuristic methods are applicable. Among the numerous approaches are ILM^{53}, HotKnots^{54}, KnotSeeker^{55}, and IPknot^{56}. ILM^{53} applies the classic principle of “hierarchic folding”; it constructs a pseudoknotted structure by iteratively predicting a most likely stem, using pseudoknot-free prediction, which is then added to the structure. HotKnots^{54} refines the construction of the pseudoknotted structure from likely more general secondary structure elements. TurboKnot^{57} even predicts conserved pseudoknots from a set of homologous RNAs. Applying a topological classification of RNA structures^{58}, TT2NE^{59} guarantees to find the best RNA structure regardless of pseudoknot complexity; however, this limits the length of treatable sequences.

Other algorithms restrict the types of pseudoknots, such that dynamic programming can be applied^{50–52,60–63}. These algorithms differ in their computational complexity and the complexity of considered structures. shows pseudoknots of different complexity.

Rivas and Eddy^{51} proposed the most general such algorithm. It predicts three-knots (), but cannot predict more complex pseudoknots such as the one shown in . Its large time and space requirements prohibit its application to large scale data analysis. The most efficient such algorithm^{52} has only the same space requirements as pseudoknot-free prediction; its run-time grows with the fourth power of the sequence length, adding only a linear factor over pseudoknot-free folding. However, it predicts only *canonical pseudoknots* (), which are formed by two canonical stems: such stems are (i) composed of canonical base pairs (ii) “*perfect*”, i.e. they do not contain interior loops or bulges, and (iii) *maximally extended*, i.e. they cannot be extended by canonical base pairs. Further such algorithms are tailored for specific interesting pseudoknot-classes ( and ^{63}). Möhl *et al*.^{64} recently managed to speed up such algorithms non-heuristically.

2.2.2. Energy models for pseudoknots A further challenge of pseudoknot prediction is to find an accurate energy model. The established loop-based energy models for RNA are tailored for pseudoknot-free structures; to date, there are no empirical energy parameters for pseudoknotted structure elements.

Consequently, some algorithms consider only the simplistic case of base-pair maximization^{65,53}. Although some authors argue that important entropy contributions in pseudoknots cannot be covered by a loop-based energy model^{66}, most approaches extend the loop-based energy model for pseudoknot-loops^{51,67}.

2.3. Tertiary structure

While secondary structure is strongly stabilizing the three-dimensional structure, the tertiary structure depends on stabilizing non-canonical base pairs and van-der-Waals interactions. Furthermore, pseudoknots impact the tertiary structure. Therefore, deriving the tertiary structure in a hierarchic way from predicted secondary structure is not straightforward.

There are two main ways to model tertiary RNA structure. One, template-based modeling, employs homology to other RNAs with known structures. The other, *de novo* prediction, computes structures from physical and knowledge based rules. For example, the MC-fold/MC-sym pipeline^{68} () assembles fragments of experimentally determined three-dimensional structures. Based on such structures, the approach builds a library of frequent small secondary structure loop-motifs, called Nucleic Cyclic Motifs (NCMs), together with their three-dimensional configurations. Given a RNA sequence, MC-Fold constructs probable secondary structures by merging NCMs; in this process it assigns likely NCMs to subsequences. MC-sym assigns concrete 3D-structures to the NCMs to generate a consistent three-dimensional structure. The approach has been tested on 13 RNAs of an average size of 30 nucleotides. Running several hours per prediction, the known 3D-structures have been reproduced within 2.3Å at average^{68}. Similar to MC-Fold, the recent RNAwolf^{69} predicts extended secondary structures considering non-canonical base pairs; the same work reports a more efficient, dynamic programming-based reimplementation of MC-Fold, which improves parameter estimation. A more detailed review and a systematic performance comparison of RNA tertiary structure prediction programs is provided by Laing *et al*.^{70}.

2.4. RNA/RNA interactions

Many ncRNAs interact with other RNAs by specific base-pairing; most prominently, microRNAs target the untranslated regions of mRNAs. Predicting RNA/RNA interactions can thus elucidate RNA interaction partners and potential functions.

Most generally, one aims to predict the secondary structure of the interaction complex of two RNAs consisting of *intra*molecular and *inter*molecular base-pairs (). Alkan *et al*.^{71} formalized the problem and showed that – similar to pseudoknot prediction – it cannot be solved efficiently. Therefore, several simplifications and heuristics have been proposed.

Most approaches restrict the possible structures of the interaction complex to enable efficient algorithms using dynamic programming. shows interaction complexes from several restriction classes. The simplest approaches ignore intramolecular base-pairs and predict only the best set of interacting base-pairs (); examples are RNAhybrid^{72} and RNAduplex^{73}.

A more general approach optimizes intra- and intermolecular base-pairs simultaneously in a restricted structure space. “Co-folding” of RNAs, for example implemented by RNAcofold^{73}, concatenates the two RNA sequences and predicts a pseudoknot-free structure for the concatenation. Co-folding leads to a very efficient algorithm but strongly restricts the space of possible structures, such that only external bases can interact ().

The dynamic-programming algorithms^{71,74} that predict more general structures (), forbidding only pseudoknots, crossing interaction, and zig-zags (), are computationally as expensive as the most complex efficient pseudoknot prediction algorithm^{51}; they are therefore rarely used in practice, albeit their efficiency has been improved recently^{75}.

Several fast methods^{71,76,77} assume that interactions form in two steps: First, the RNA unfolds partially, which requires certain energy to open the intramolecular base-pairs. Second, the unfolded, now accessible, RNA hybridizes with its partner forming energetically favorable intermolecular base-pairs. RNAup^{76} computes the energies to unfold each subsequence in the single RNAs and combines the unfolding energies with the hybridization energies to approximate the energy of the interaction complex. IntaRNA^{77} optimizes this approach and extends it to screen large data sets for potential interaction targets.

Finally, several approaches predict conserved interactions between multiple sequence alignments^{78,79}.

2.5. Kinetic folding

Common structure prediction methods assume that the functional RNA structure can be identified solely based on the thermodynamic equilibrium without considering the dynamics of the folding process. Although the true impact of kinetics on functional RNA structures is still unknown, for example RNA-switches^{80} show the importance of understanding the RNA folding process.

Several groups have studied the folding process of RNAs (reviewed in^{66,83}). The RNA folding process is commonly modeled using energy landscapes^{84}. Such landscapes assign energies to single structures, or states, and define neighborship between states. RNAlocopt^{85} enables studying the Boltzmann ensemble of local optima in an RNA energy landscape. The folding process iteratively moves from one state to a neighbor; the move probability depends on the energy difference. Studying folding by simulation^{83} is expensive since it requires averaging over many trajectories. Because the exact model of folding as a Markov process can be solved only for small systems, many methods coarse-grain the energy landscape to enable the analysis of the process. For example “barrier-trees” represent the energy landscape as a tree of local minima connected by their saddle points^{82}. BarMap^{81} generalizes coarse-graining to non-stationary scenarios like temperature changes or co-transcriptional folding.^{86} predicts RNA folding pathways based on motion planning techniques from robotics. Kinefold^{87} simulates single folding paths over seconds to minutes for sequences up to 400 bases.