PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of narLink to Publisher's site
 
Nucleic Acids Res. 2012 May; 40(10): 4261–4272.
Published online 2012 January 28. doi:  10.1093/nar/gks009
PMCID: PMC3378861

RNA folding with soft constraints: reconciliation of probing data and thermodynamic secondary structure prediction

Abstract

Thermodynamic folding algorithms and structure probing experiments are commonly used to determine the secondary structure of RNAs. Here we propose a formal framework to reconcile information from both prediction algorithms and probing experiments. The thermodynamic energy parameters are adjusted using ‘pseudo-energies’ to minimize the discrepancy between prediction and experiment. Our framework differs from related approaches that used pseudo-energies in several key aspects. (i) The energy model is only changed when necessary and no adjustments are made if prediction and experiment are consistent. (ii) Pseudo-energies remain biophysically interpretable and hold positional information where experiment and model disagree. (iii) The whole thermodynamic ensemble of structures is considered thus allowing to reconstruct mixtures of suboptimal structures from seemingly contradicting data. (iv) The noise of the energy model and the experimental data is explicitly modeled leading to an intuitive weighting factor through which the problem can be seen as folding with ‘soft’ constraints of different strength. We present an efficient algorithm to iteratively calculate pseudo-energies within this framework and demonstrate how this approach can be used in combination with SHAPE chemical probing data to improve secondary structure prediction. We further demonstrate that the pseudo-energies correlate with biophysical effects that are known to affect RNA folding such as chemical nucleotide modifications and protein binding.

INTRODUCTION

RNAs fulfill a large number of diverse biological functions in the cell (1). This wide functional spectrum of RNAs is made possible by the structural diversity of these highly flexible molecules. Studying the structure of a novel RNA thus is often the first step toward elucidating a possible biological function. Resolving the complete tertiary structure is a complex undertaking, however, so it is usually the secondary structure that is analyzed first. In a typical probing experiment, the RNA is enzymatically digested or chemically modified in a manner that is specific for structural context (2,3). These experiments typically reveal which nucleotides are contained within a double-stranded helix and which nucleotides form unpaired loops. In addition, solvent accessibility or local flexibility can be assessed, see Ref. (4) for a recent review. Structure probing experiments have been routinely used for many years. More recently, high-throughput methods have been introduced (58) and next-generation sequencing techniques have made it possible to perform probing experiments even on a genome-wide scale (9,10).

All these experiments, however, only report partial information on the structure and even a perfect experiment does not reveal the actual base pairing patterns (11). Therefore, the results of probing experiments need to be combined with computational predictions. Most commonly, programs such as mfold (12), RNAstructure (13) or RNAfold (14) are employed that predict secondary structures by minimizing free energy. They are based on an empirical energy model (15), which is based on a very large set of thermodynamic measurements on small RNA oligonucleotides. In the simplest case, the predicted structure is manually adjusted to fit the measured constraints. To automatize this process, prediction programs allow the user to restrict the search space to only consider structures compatible with certain constraints observed in the probing data.

An alternative method is to include information from experiments as ‘pseudo-energies’ in the energy model. This approach was introduced by Matthews et al. (16,17) and is implemented in the program RNAstructure (13). On chemical probing data generated by selective 2′-hydroxyl acylation analyzed by primer extension (SHAPE) (5), it showed nearly perfect results on Escherichia coli 16s rRNA (17) and it was successfully used to predict structure models for the complete HIV genome (18).

In this article, we expand on the idea of incorporating experimental data as pseudo-energies into energy-based folding algorithms. Instead of adding ad hoc modifications to the minimum free energy calculation, we propose a formal method to reconcile experimental information with the theoretical prediction in the partition function over all possible structures. The partition function describes the entire ensemble of secondary structures in thermodynamic equilibrium and allows to calculate an intuitive matrix of base pairing probabilities (19).

Our approach is based on the assumption that both experimental measurements and the thermodynamic energy parameters are imperfect, noisy approximations of the physical reality. In this setting, it becomes natural to ask for a perturbation vector that minimizes a weighted sum of perturbation energies and discrepancies between measured and predicted base pairing probabilities. In the simplest case, this can be written as a least square approximation problem of the form

equation image
(1)

Here, An external file that holds a picture, illustration, etc.
Object name is gks009i1.jpg is the perturbation vector and ϵμ the perturbation energy added for some structural element μ. An external file that holds a picture, illustration, etc.
Object name is gks009i2.jpg and qi are the predicted and measured base pairing probabilities for position i, respectively. The estimated variances An external file that holds a picture, illustration, etc.
Object name is gks009i3.jpg and An external file that holds a picture, illustration, etc.
Object name is gks009i4.jpg of energy parameters and measurements, respectively, serve as weighting factors. This optimization problem can be viewed as energy directed folding with soft constraints replacing the hard combinatorial constraints used before.

We show here (i) that the pseudo-energies An external file that holds a picture, illustration, etc.
Object name is gks009i5.jpg can be efficiently calculated by an iterative algorithm, (ii) that the approach combined with SHAPE data leads to improved secondary structure predictions, (iii) that the algorithm also can successfully handle cases of RNAs with several alternative structures and (iv) that the pseudo-energies have an interpretable meaning and indicate positions where experimental data and the thermodynamic energy model disagree.

MATERIALS AND METHODS

Minimization of the objective function using gradient descent

The objective function in Equation (1) and the motivation behind it is explained in more detail under ‘Rationale’ in the ‘Results’ section. Here, we show how to efficiently find the minimum of this function.

The minimum of the objective function, Equation (1), satisfies [partial differential]F/[partial differential]ϵμ = 0 for all parameters, i.e.

equation image
(2)

Numerically, this can be solved by iteratively minimizing F. We use a gradient descent iteration of the form

equation image
(3)

with a step size a < 0. We chose this approach because it only depends on the first-order derivatives of pi with respect to ϵμ. In the following paragraphs, we show that the required partial derivatives An external file that holds a picture, illustration, etc.
Object name is gks009i6.jpg can be obtained analytically from constrained partition functions.

Analytic calculation of the gradient

Since ϵμ denotes the energy contribution that is added to all secondary structures that contain a particular ‘structural feature’ μ, we can subdivide the structure ensemble into those structures that ‘have μ’, and those that do not. This is possible for any parameter of the standard energy model and for any additional position-dependent term. Let Z[i](ϵμ) be the partition function over all states with position i unpaired in the perturbed energy model, whereas Z[i](0) is the corresponding partition function in the reference state. Similarly, Z[μ](.) and Z[i, μ](.) denote the partition functions over all structures that ‘have μ’, and of those that both ‘have μ’ and leave i unpaired, respectively, for each of the two energy models. The crucial observation is that the following identities hold for these constrained partition functions:

equation image
(4)

By construction, furthermore, we have

equation image
(5)

Since pi(.) = Z[i](.)/Z(.), we can express the partial derivatives in terms of restricted partition functions. We only need to compute the derivates at the reference energy model (which we take to be the energy model in each step of the gradient iteration).

equation image
(6)

The probabilities of the structural patterns, p[μ], can be obtained by McCaskill's algorithm (19) provided μ is a base pair (k,l), an unpaired position j, or another feature that appears implicitly in the dynamic programming recursions.

Implementation for position-specific perturbations

Here we consider the simplest case of perturbations that add positive or negative energy contributions to single positions. In that case, we can replace the generic dimension μ with an additional index 1  j  n, and the gradient takes the form

equation image
(7)

We have extended the implementation of McCaskill's algorithm in the Vienna RNA package (version 2.0 beta) to calculate the partition function of a sequence with additional position-specific energy contributions. More precisely, during the energy evaluation step that calculates energies for different structural elements such as stacked pairs, hairpins, interior loops and multiloops, we add ϵj if position j is unpaired for the particular structural element. By adding negative (favorable) perturbation energies, we enforce a position to be unpaired, while adding positive perturbation energies will lead to a position be more likely to be paired. pi can then directly be calculated from the partition function under the perturbed energy model.

In principle, also the term p[j|i] can be easily calculated directly from the partition function. The conditional probability that j is unpaired given that i is unpaired as well can be obtained by constraining the dynamic programming recursion to structures in which i is unpaired. However, the partition function algorithm scales An external file that holds a picture, illustration, etc.
Object name is gks009i53.jpg(n3) in CPU time with length n. Evaluating all n conditional probabilities renders the whole algorithm requiring An external file that holds a picture, illustration, etc.
Object name is gks009i54.jpg(n4). This is too expensive in terms of computational resource for practical applications, however.

To overcome this problem, we estimate the term p[j|i] by sampling structures from the thermodynamic ensemble. We use stochastic backtracking (20) to randomly generate structures proportional to their Boltzmann weight and empirically determine p[j|i] from the random structures.

To get actual structure models from the base pair probability matrix, we used the maximum-expected accuracy approach (21,22) with a γ-parameter of 1.0.

Missing data, i.e. positions i for which no qi is available, are handled transparently by setting σ = ∞ resulting in position i being effectively ignored in the evaluation of the objective function [Equation (1)].

Analysis of SHAPE data

We used SHAPE reactivities for 23S and 16S rRNAs as reported by Deigan et al. (17). The 23S and 16S rRNAs were split in 6 and 4 domains, respectively, of a maximum length of 700 nt as described before (21). We did not use domain 4 of the 16S rRNA because it was poorly covered by the SHAPE data and mainly consisted of missing data. As a reference structure, we used the same phylogenetically derived structure as in Ref. (17). Accuracy was measured as Sensitivity = (number of correctly predicted base pairs)/(total number of known base pairs) and the positive predictive value PPV = (number of correctly predicted base pairs)/(total number of predicted base pairs). As a combined measure of sensitivity and positive predictive value, we also used the Mathews correlation coefficient as described previously (23). Deigan et al. found 16.5 and 13.6% of the positions in the 16S and 23S rRNA, respectively, where the in vitro folded RNA as probed by the SHAPE data is different from the phylogenetic structure (corresponding the in vivo proteinized state). Deigan et al. removed these sites in their benchmark, while we kept it for the benchmarks reported here. The overall accuracies achieved in our benchmarks are therefore lower as reported by Deigan et al.

To use the SHAPE data with our algorithm, we discretized the reactivities by classifying them in paired and unpaired positions using a cutoff of 0.25 [see also Ref. (11)]. This cutoff corresponds to an error rate of about 25% of positions being incorrectly classified. We also tried a two cutoff approach and classified all positions with SHAPE reactivities <0.1 and >1.5 as paired and unpaired, respectively. This lowers the error rate to about 10% at the expense of a lower coverage of around 50%, i.e. more missing data. We did not see any significant advantage using this approach for any of the methods (data not shown). Furthermore, we also tried more sophisticated machine learning methods to classify bases as ‘paired’ and ‘unpaired’ according to their SHAPE signal. Essentially, we face a machine learning problem to parse the continuous SHAPE signal as shown in Supplementary Figure S1E into discrete states. In principle, this enables us to consider also the context of a base during classification. However, also here we did not find a significant improvement over the simple thresholding approach.

RNAstructure (version 5.3) was run with default values and with parameters of m = 2.6 and b = −0.8 and these were found to be optimal on this specific data set (17). The ‘Sample + Select’ strategy described in Ref. (11) was re-implemented using RNAfold, 105 structures were sampled and the structure with the lowest Manhattan distance to the discretized SHAPE vector was used. The results for hard constraints were calculated with RNAfold and the option -C.

Availability

All source code accompanying this article can be downloaded here: https://github.com/wash/probing.

RESULTS

Rationale

Similar to previous approaches (16,17), our algorithm modifies the folding energy parameters (15), which are used in RNAfold, mfold and RNAstructure. In the following, we refer to these standard parameters as the ‘reference energy model’ and the positive or negative pseudo-energies that change this model as ‘perturbations’. Let An external file that holds a picture, illustration, etc.
Object name is gks009i7.jpg be a vector of perturbations of the reference energy model. In the most generic formulation, we consider a collection of structural elements whose contribution to the energy model can be perturbed. We use the index μ to refer to one of these degrees of freedom, which correspond to the coordinates of the vector of perturbation energies An external file that holds a picture, illustration, etc.
Object name is gks009i8.jpg. Note that these degrees of freedom need not be structural elements that correspond to parameter of the reference energy model.

Our goal is to find a vector An external file that holds a picture, illustration, etc.
Object name is gks009i9.jpg that changes the standard energy model in the light of the experimental data. Deigan et al. (17) chose the perturbations proportional to the experimental signal. More precisely, for each position i they mapped the SHAPE reactivity R(i)—the experimental signal for being unpaired (24)—to perturbation energies using the following relationship a + mln[1 + R(i)].

In practice, this strategy gave good results. Theoretically, however, this approach is poorly justified; in particular, there does not seem to be a meaningful biophysical interpretation of the energy model. Ideally, if experiment and the energy model agree perfectly, An external file that holds a picture, illustration, etc.
Object name is gks009i10.jpg should vanish. By setting An external file that holds a picture, illustration, etc.
Object name is gks009i11.jpg proportional to the experimental signal, however, the exact opposite is the case. Positions that show the highest signal in the experiment and already are predicted with high probability are assigned the highest perturbation energies.

Here, we regard both the experimental data and the structure prediction based on the energy model as a noisy approximation to the physical ground truth. Therefore, our goal is to find a perturbation vector that minimizes the discrepancy between the experimental measurement and computational prediction. In particular, we seek a perturbation vector An external file that holds a picture, illustration, etc.
Object name is gks009i7.jpg that modifies the energy model only when necessary.

This is achieved by minimizing the total error of both energy model and measurements. We assume that the experimental data is given in form of a probabilistic signal as a vector qi of the probabilities that position i is unpaired and an associated variance An external file that holds a picture, illustration, etc.
Object name is gks009i12.jpg. Likewise, we assume a variance An external file that holds a picture, illustration, etc.
Object name is gks009i13.jpg for the uncertainty of the parameters of the standard energy model. Assuming, furthermore, that individual energy parameters as well as the measurements for each sequence position are independent, we obtain the error function

equation image

Here, An external file that holds a picture, illustration, etc.
Object name is gks009i14.jpg is the predicted probability that nucleotide i is unpaired in the energy model perturbed by An external file that holds a picture, illustration, etc.
Object name is gks009i15.jpg. The choice of the quadratic error function An external file that holds a picture, illustration, etc.
Object name is gks009i16.jpg is the most natural one from a mathematical point of view since all its terms have a natural interpretion as variances. The minimization problem thus evenly distributes the residual deviations between energy parameter set and measured data depending on their intrinsic variances An external file that holds a picture, illustration, etc.
Object name is gks009i17.jpg and An external file that holds a picture, illustration, etc.
Object name is gks009i18.jpg.

In principle, both the variances An external file that holds a picture, illustration, etc.
Object name is gks009i19.jpg and An external file that holds a picture, illustration, etc.
Object name is gks009i20.jpg can be estimated from probing experiments and the experiments underlying the standard energy model, respectively. However, in this article we will not use explicit estimates but rather treat them as parameters that control whether more weight is given on the experimental data or prediction of the energy model. If τ [dbl greater-than sign] σ the algorithm will find a solution closest to the experimental information, while in the other case the experimental information will be ignored and the solution will be essentially the same as the prediction of the unperturbed energy model. Note that the solution An external file that holds a picture, illustration, etc.
Object name is gks009i21.jpg of the optimization problem depends only on the ratio τ/σ, i.e. on the relative accuracy of the energy model and measured probing data.

Iterative adaption of the energy parameters

So far we did not specify which parameters μ of the energy model are actually considered to be subject of perturbations. Since typical experiments only report data on whether a base is likely to be paired or unpaired, it is not useful to consider base pairs or any higher order structural elements. We therefore concentrate on the simplest case and consider only position-specific perturbations ϵj (‘Materials and Methods’ section).

We have implemented an efficient strategy to find the minimum of the objective function in this case (‘Materials and Methods’ section). It is based on a gradient descent algorithm. The gradient for the objective function can be calculated analytically (see ‘Materials and Methods’ section).

We first tested the algorithm on an artificial sequence that can fold in two alternative structures. The one-stem structure corresponding to the ground state of the unperturbed energy model, An external file that holds a picture, illustration, etc.
Object name is gks009i22.jpg, is energetically highly favorable. The less stable alternative three-stem structure is used here as the experimentally supported structure that our algorithm is supposed to recover. We considered the paired/unpaired probability profile of the target structure as perfect ‘experimental’ data and set qi to 0.0 or 1.0 for paired and unpaired positions i, respectively. Accordingly, we chose the associated variance σ2 of An external file that holds a picture, illustration, etc.
Object name is gks009i23.jpg low and set σ2 = 0.01 and τ2 = 1.0. This example is a hard test for our approach: since the two structures have very distinct pairing profiles, major refolding is required to correct the energy-based prediction.

We start with An external file that holds a picture, illustration, etc.
Object name is gks009i24.jpg. Using the exact solution for the gradient [Equation (7), ‘Materials and Methods’ section), we observe that the algorithm finds a minimum after about 150 iterations (Figure 1, upper left diagram). This minimum is confirmed by the fact that norm of the gradient (Figure 1, below) converges to zero and is <0.001 after 246 iterations. The corresponding base pairing probability matrices gradually change from the original one-stem structure to the alternative three-stem structure. In the minimum, the structure is completely refolded and conforms to the desired target structure (Figure 1, right).

Figure 1.
Iterative adaption of energy parameters. A sequence is re-folded from a single-stem structure to a three-stem structure. The diagram on the left shows the minimization path of the objective function [Equation (1)] and the associated norm of the gradient. ...

We repeated the minimization calculation starting from five different random vectors An external file that holds a picture, illustration, etc.
Object name is gks009i25.jpg. All five start points lead to the same minimum confirming that the procedure is robust and finds consistently the same solution. To further confirm the validity of the analytically derived gradient, we repeated the iteration with a numerically calculated gradient An external file that holds a picture, illustration, etc.
Object name is gks009i26.jpg. Setting d = 10−5, we observe that both solutions lead to exactly the same minimization path (Figure 1).

Efficient solutions for long sequences

The exact analytical solution of the gradient as well as the numerical approximation scales as An external file that holds a picture, illustration, etc.
Object name is gks009i55.jpg(n4) with the sequence length n (‘Materials and Methods’ section). The form of the analytical solution given in Equation (7) (‘Materials and Methods’ section), however, suggests that a major speedup can be achieved if the term p[i|j] can be computed more efficiently. This can be done by random sampling from the thermodynamic equilibrium since an accurate estimate is needed only when position j is unpaired with a noticeable probability, so that fairly small samples are sufficient. We repeated the minimization using this approximation. Sampling the gradient from 10 000 random structures leads to the same minimum as the exact solution. In this particular example, we observe a slight deviation in the minimization paths after about eight iterations (Figure 1). In most other examples, however, we observed the paths to be identical. Only when the minimization reaches a point close to convergence, the approximated gradient fails to further improve the objective function. In this example, the calculation with the sampled gradient stopped after 113 iterations with the norm of the gradient in the order of 1.

To verify that our algorithm is capable of finding the solution also for longer sequences in reasonable time we ran the algorithm on RNAs of different lengths. We used the same parameters as before and used the known secondary structure as ‘target’ structure. To test if the sampled gradient gives the same solution as the exact gradient, we ran the minimization with the exact gradient until the norm of the gradient was <0.1 and with the sampled gradient until the objective function could not be improved any more. Using the sampling approach, the solution was found within seconds for small RNAs of about 100 nt like tRNAs or the 5S rRNA and within minutes for longer RNAs of about 300 nt (Table 1). Following previous work (17,21), the longest sequence tested was 686 nt long. Also for this length our algorithm using the sampled gradient could find a solution within an hour. In contrast, the exact solution took longer than 6 days. The objective function was minimized by more than 95% in all cases. Despite the extreme differences in running time, the sampling approach led to essentially the same rate of minimization as the exact approach.

Table 1.
Optimization efficiency for RNAs of various length

Improved structure models using SHAPE probing data

We next demonstrate that our algorithm in the course of minimizing the objective function actually optimizes the secondary structure prediction. Following Deigan et al. (17), we used E. coli 23S and 16S rRNA to benchmark the structure models obtained by our algorithm.

First, we considered the limiting case of perfect data, i.e. we used the paired/unpaired profile of the reference structures (‘Materials and Methods’ section) as input. In that case, the new iterative ‘soft constraint’ algorithm should give the same results as the ‘hard’ combinatorial constraints that can be applied to classical minimum free energy folding. We ran our algorithm with different combinations of τ/σ and compared the results to RNAfold without constraints and with hard constraints (Figure 2A). Combinations σ [dbl greater-than sign] τ essentially ignore the external data resulting in similar predictions as standard RNAfold. With increasing weight on the external data (i.e. σ [double less-than sign] τ), the accuracy increases and finally converges to the same level of RNAfold with hard constraints. This level represents the theoretical accuracy that can be achieved by the combination of thermodynamic folding with probing data.

Figure 2.
Structure prediction benchmark. (A) Prediction accuracy on 23S and 16S rRNAs as measured by the Matthews correlation coefficient (higher is better). Our iterative algorithm was run with different combinations of τ/σ on ‘perfect ...

We next used SHAPE data (17) to test our algorithm on real probing data. The SHAPE signal measures local nucleotide flexibility. The signal is generally higher in unpaired regions than in paired regions (Supplementary Figure S1A–C). It is important to note, however, that there is no simple relationship between nucleotide flexibility and base pair probabilities and there are systematic differences between these two properties beyond statistical noise (Supplementary Figure S1D and E). For example, SHAPE signals have a typical peak structure with nucleotides in the middle of a loop being usually the most reactive. However, the probability of these nucleotides to be unpaired in the thermodynamic ensemble has generally not the same peak shape (Supplementary Figure S1E). We have tried various ways to map the SHAPE signal to the probability vector qi. However, we found that converting the SHAPE signal into a discrete vector with qi = {1.0, 0.0} using a simple thresholding approach (‘Materials and Methods’ section) gave the best results.

Again, we ran our algorithm with varying values of τ/σ (Figure 2A). We observed an improvement in prediction accuracy over the standard RNAfold prediction with increasing weight on the SHAPE data. However, at around τ/σ = 0.5 the improvement peaks for both the 23S and 16S rRNA, and due to the inherent noise in the SHAPE experiment, the accuracy drops again when more weight is given on the experimental data.

We also run RNAfold with hard constraints on the same data. Here, the accuracy does not improve and is generally worse than RNAfold without probing data. In contrast to the soft constraint algorithm, a small number of inaccurate constraints introduced by the noise in the data can almost completely destroy the prediction in this case.

We further compared to two other methods that were used in combination with SHAPE data before. We ran minimum free energy prediction augmented with pseudo energies as described in Deigan et al. (RNAstructure + SHAPE) (17). We used the same parameters m and b that were found to be optimal on exactly the same data by Deigan et al. In addition, we also implemented the ‘Sample and Select’ approach described in Quarrier et al. (11). This strategy samples a large number of random structure from the ensemble and chooses a structure with the minimum distance to the probing data under a simple distance metric (‘Materials and Methods’ section). Figure 2B summarizes the results for all methods averaged over all domains of both rRNAs. We found that all methods except RNAfold with hard constraints lead to improved predictions over RNAfold (and the equivalent RNAstructure implementation) of about 15–20%. Our soft constrained algorithm achieves 0.70 ± 0.08 sensitivity and 0.71±0.07 positive predictive value, while ‘RNAstructure + SHAPE’ and the ‘Sample + Select’ approach achieve 0.70 ± 0.07/0.67±0.09 and 0.67 ± 0.07/0.65 ± 0.08, respectively.

Recovering the ensemble of a bistable structure

So far we only considered the case that the external pairing signal originates from a single target structure. However, RNA molecules typically are not present as a single structure but form an ensemble in which very different structures can be present simultaneously. This is of biological significance in particular for riboswitches (25) and ribozymes (2628). The signal measured in a probing experiment, therefore, will in general be a superposition of responses from structural alternatives. We tested, therefore, if our algorithm can recover the base pairing matrix of more complex ensembles of alternative structures. We used a sequence that served as a starting point to design an effective thermoswitch (29). The sequence can fold into two alternative structures (a single hairpin or a two-stem structure). Folding with RNAfold at 37°C predicts that both alternatives are roughly equally probable in the ensemble (see base-pairing matrix labeled as ‘target ensemble’ in Figure 3). At low temperatures the single hairpin dominates. We asked if we can induce the mixed ensemble at low temperature by modifying the energy parameters using a perturbation vector. This represents a common situation where the experimental conditions such as temperature or salt concentration are different in the experiment and in the thermodynamic model.

Figure 3.
Recovering the correct structure ensemble of a bistable structure. A sequence that folds into a one- and two-stem structure with equal probabilities at 37°C, folds predominantly into the one-stem structure at 10°C. Using the qi vector ...

First we tried the method from Deigan et al. (17) and set ϵi = b + mln[1 + qi]. For qi we used the probability of being unpaired in the target ensemble at 37°C and we set m = 2.75 and b = −0.75, a combination that generally worked well in our implementation and that is also close to the published parameters. Using this approach, the resulting base pair matrix only shows one hairpin structure and not the expected ensemble of the two alternative structures (Figure 3A). We also tried to systematically search for other parameter pairs m and b and also other combinations failed to recover the correct ensemble. We next used our iterative minimization algorithm and set the probability of being unpaired at 37°C as our input vector qi. Running our algorithm at 10°C with σ2 = 0.01 and τ2 = 1.0, we could calculate a perturbation vector that gives exactly the expected results (Figure 3A).

This simple example highlights a major advantage of the present approach over both hard constraints and simplistic bonus energies: since we consider the entire Boltzmann ensemble and model the observable experimental signal as a superposition of contributions from the individual members of the ensemble, we can also accommodate seemingly conflicting data that arise from different subsets of structures in the ensemble. The effect of our pseudo-energies is merely to distort the relative frequencies of structures within the ensemble.

Correlation of perturbation energies with nucleotide modifications

Another advantage of our algorithm is that it calculates position-specific perturbation energies that are non-zero only when they are required to reconcile the experimentally observed data with the energy model. The perturbations thus identify regions along the sequence where the energy model fails to accurately represent the observed data.

Chemically modified nucleotides are an important source of inaccuracies because they are not explicitly considered in the energy model. Such post-transcriptional modifications are common in several classes of non-coding RNAs. They are particularly well-studied for tRNAs (30,31). Generally, tRNAs fold into the functional cloverleaf structure spontaneously in vitro without being modified (31). However, there is one well-known exception to this rule. The human mitochondrial tRNA-Lys was found to be misfolded in vitro while forming the canonical cloverleaf in vivo. One particular base methylation is sufficient to induce the correct folding also in vitro (32).

Theoretically, our approach should be able to identify nucleotides with modifications that influence their pairing behavior. In such a situation, we expect a large perturbation energy localized at the modified nucleotide and possibly its pairing partner. We thus analyzed the behavior of the mito-tRNA-Lys in silico. Folding with RNAfold clearly does not result in the typical cloverleaf structure but rather yields an extended stem structure (Figure 4). This is consistent with the in vitro results, which also did not show the canonical cloverleaf structure. We ran our algorithm on the sequence and imposed the cloverleaf structure as external constraint. The algorithm finds a minimum after 18 iterations and leads to a refold of the structure. The resulting perturbation vector shows two distinct peaks strongly suggesting that the base pair stacks between positions 8,9 and 61,62 is the most critical for the molecule to fold into the correct structure. The high peak that suppresses this base pair stack corresponds to the methylation that also was shown in vitro to be responsible for the refolding. It is important to note that a simple comparison of the misfolded prediction of the standard energy model to the reference structure will not give the same information (see the difference plot of pi of the initial prediction and qi of the reference structure in Figure 4). Since the molecule undergoes re-folding the initial pi of the misfolded structure is not informative and only an iterative approach will identify the positions critical for the structure change.

Figure 4.
Perturbation energies correlate with nucleotide modifications in tRNAs. Human mitochondrial tRNA-Lys does not fold into the canonical cloverleaf using the standard energy model (top) but can be easily re-folded with perturbations calculated by our algorithm ...

We further asked if there is a general correlation of nucleotide modifications and perturbations calculated from our algorithm. To this end, we analyzed 160 tRNAs contained in the MODOMICS database (33) in exactly the same way as the human mito-tRNA-Lys example above. The MODOMICS database contains experimentally determined nucleotide modifications of various RNAs. Again, we used the canonical cloverleaf structure as the input of our algorithm. We found that the absolute value of the perturbations for modified bases (0.25 kcal/mol) is on average higher for modified bases than those for unmodified bases (0.17 kcal/mol). The difference (Figure 5) is significant (Mann–Whitney test P < 2 × 10−16) and implies that discrepancies between the standard energy model and the canonical tRNA structure can be partly attributed to nucleotide modifications. However, we only found few candidates where a nucleotide modification seems to directly cause a complete refold. This confirms that the human mitochondrial tRNA-Lys described in the literature is an outstanding example and most other tRNAs fold into the cloverleaf shape spontaneously without modification (31).

Figure 5.
Distribution of perturbation energies for modified and non-modified nucleotides in tRNAs. The 160 tRNAs with known modifications were forced to fold into the canonical tRNA structure and the values of the perturbation energies were analyzed. Non-zero ...

Correlation of pseudo-energies and protein binding

RNA binding proteins are another reason that can cause differences between experimentally observed and thermodynamically predicted structures (34). The 5′-end of the sodB mRNA in Escherichia coli was found to change the structure upon binding of Hfq (35). Hfq acts as a chaperone and opens a region that forms a intermolecular interaction with the small RNA RyhB. We ran our algorithm applying the structure model proposed for the sodB mRNA by Geissmann et al. (35). We observed high energy perturbations in the second half of the analyzed region (Figure 6), which corresponds exactly to the region that shows the protein-induced structure change.

Figure 6.
Energy perturbations correlate with Hfq induced structure changes in the sodB mRNA. A perturbation vector for the standard energy model was calculated to fit the experimentally established structure model by Geissmann et al. (35). The dotplot and color ...

DISCUSSION

The combination of thermodynamic folding and structure probing experiments is currently the standard method to establish secondary structure models. Probing experiments have seen rapid development over the past years leading to probing data for the complete HIV genome (18) and pilot studies of transcriptome wide probing in yeast (9) and mouse (10). Scaling the problem from individual RNAs to genome-wide data is not only an experimental challenge. The computational analysis of probing experiments to automatically generate reliable structure models seems equally challenging. There are many steps involved and sophisticated methods to pre-process the data that have been developed (8,10). Here, we addressed the last step in this process, the actual folding step.

We proposed a novel way to incorporate experimental constraints into classical thermodynamic folding. Hard combinatorial constraints that have been used for long time only make sense when a model is manually built for an individual RNA, but does not scale to automatic structure prediction from noisy data. Therefore, we introduced a ‘soft constraint’ approach that is based on pseudo-energies that favor individual positions to be paired or unpaired. We formulated the problem using the partition function, which offers the most flexible description of the thermodynamics properties of an RNA and allows for example to calculate pair probabilities or study suboptimal structures (19). Since previous pseudo-energy approaches cannot be easily applied in that case (see ‘Rationale’ section), we introduced a formal framework to reconcile external constraints and thermodynamic predictions. In this framework, pseudo-energies have an interpretable meaning and the system shows some important properties such as the simple fact that in the case of experiment and thermodynamic model being in perfect agreement no pseudo-energies are applied. However, an iterative algorithm is required in practice to find the optimal pseudo-energies. We derived an analytic expression for the gradient of this optimization problem which allows for effective minimization.

We tested our method on a SHAPE data set of rRNAs that has been used for benchmarking previously. It provides of ~4000 probed positions (17) allowing for statistically relevant comparisons between methods. Unfortunately, similarly sized data sets are not available for other RNAs and it remains to be determined how our results generalize across various other classes of RNAs.

On the rRNA data set, we found that our soft constraint approach with SHAPE data clearly improves structure prediction compared with normal thermodynamic folding. Varying the weight of the probing data used for the prediction identifies a maximum in accuracy, which, however, stays well below the best value theoretically possible with perfect data (Figure 2). Although our algorithm performs well in this particular benchmark, it could not clearly outperform for example the much simpler method by Deigan et al. An important observation is that the difference between the observed and theoretically possible performance is much larger than the differences between the various methods. This suggests that substantial improvements cannot be achieved by improving the folding algorithm in a generic way but rather through more efficient noise filtering and pre-processing of the raw data from the various experimental protocols. Although there is a clear correlation of SHAPE reactivities and pair probabilities, it is not straightforward to find a simple model to describe this relationship. The SHAPE reactivity measuring the local flexibility of a nucleotide seems to be dependent on the structural context, i.e. the type of loop (hairpin, bulge) and the position within the loop. It is also influenced by tertiary interactions. Systematic studies with different classes of RNAs will be necessary to understand this signal and the associated noise in more detail. Here, we used a simple tresholding method to convert the SHAPE reactivities into discrete states (paired or unpaired) as input for our algorithm.

We also studied the behavior of our algorithm on individual examples and found that it is capable of recovering the correct thermodynamic ensemble of a bistable RNA (29), identify the critical positions of nucleotide modifications required for correct in vivo folding of human mitochondrial tRNA-Lys (32) and the region of Hfq-induced structure changes in sodB mRNA (35). These applications demonstrate the usefulness of our ‘soft constrained’ partition function approach beyond pure structure prediction.

Finally, we have formulated the problem in a generic form such that the methodology presented in this article is not limited to classical chemical or enzymatic probing data for individual positions. A new experimental procedure has been proposed that provides information on particular base pairs on a short model RNA by extending classical probing with systematic mutation strategies (36). Our algorithm can be extended to any structural element for which the probability can be calculated from the partition function, including specific base pairs which would allow one to analyze also the type of experiments presented by Kladwang et al. (36).

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Figure S1.

FUNDING

Austrian Fonds zur Förderung der Wissenschaftlichen Forschung (Schrödinger Fellowship J2966-B12 to S.W.). Deutsche Forschungsgemeinschaft as part of the priority program “Sensory and Regulatory RNAs in Prokaryotes” (SPP 1258, to P.F.S.). Funding for open access charge: Massachusetts Institute of Technology.

Conflict of interest statement. None declared.

Supplementary Material

Supplementary Data:

ACKNOWLEDGEMENTS

We thank Michael Kertesz, Eran Segal and Howard Chang for helpful discussions, David Mathews for providing SHAPE data, Ronny Lorenz and Stephan Bernhart for help with the Vienna RNA package.

REFERENCES

1. Amaral P, Dinger M, Mercer T, Mattick J. The eukaryotic genome as an RNA machine. Science. 2008;319:1787–1789. [PubMed]
2. Ehresmann C, Baudin F, Mougel M, Romby P, Ebel J, Ehresmann B. Probing the structure of RNAs in solution. Nucleic Acids Res. 1987;15:9109–9128. [PMC free article] [PubMed]
3. Knapp G. Enzymatic approaches to probing of RNA secondary and tertiary structure. Methods Enzymol. 1989;180:192–212. [PubMed]
4. Weeks K. Advances in RNA structure analysis by chemical probing. Curr. Opin. Struct. Biol. 2010;20:295–304. [PMC free article] [PubMed]
5. Merino EJ, Wilkinson KA, Coughlan JL, Weeks KM. RNA structure analysis at single nucleotide resolution by selective 2′-hydroxyl acylation and primer extension (SHAPE) J. Am. Chem. Soc. 2005;127:4223–4231. [PubMed]
6. Mitra S, Shcherbakova IV, Altman RB, Brenowitz M, Laederach A. High-throughput single-nucleotide structural mapping by capillary automated footprinting analysis. Nucleic Acids Res. 2008;36:e63. [PMC free article] [PubMed]
7. Lucks J, Mortimer S, Trapnell C, Luo S, Aviran S, Schroth G, Pachter L, Doudna J, Arkin A. Multiplexed RNA structure characterization with selective 2′-hydroxyl acylation analyzed by primer extension sequencing (shape-seq) Proc. Natl Acad. Sci. USA. 2011;108:11069–11074. [PubMed]
8. Aviran S, Trapnell C, Lucks J, Mortimer S, Luo S, Schroth G, Doudna J, Arkin A, Pachter L. Modeling and automation of sequencing-based characterization of RNA structure. Proc. Natl Acad. Sci. USA. 2011;108:11063–11068. [PubMed]
9. Kertesz M, Wan Y, Mazor E, Rinn J, Nutter R, Chang H, Segal E. Genome-wide measurement of RNA secondary structure in yeast. Nature. 2010;467:103–107. [PubMed]
10. Underwood J, Uzilov A, Katzman S, Onodera C, Mainzer J, Mathews D, Lowe T, Salama S, Haussler D. FragSeq: transcriptome-wide RNA structure probing using high-throughput sequencing. Nat. Methods. 2010;7:995–1001. [PMC free article] [PubMed]
11. Quarrier S, Martin J, Davis-Neulander L, Beauregard A, Laederach A. Evaluation of the information content of RNA structure mapping data for secondary structure prediction. RNA. 2010;16:1108–1117. [PubMed]
12. Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003;31:3406–3415. [PMC free article] [PubMed]
13. Reuter J, Mathews D. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics. 2010;11:129. [PMC free article] [PubMed]
14. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. Fast Folding and Comparison of RNA Secondary Structures. Monatsh. Chem. 1994;125:167–188.
15. Mathews D, Sabina J, Zuker M, Turner D. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 1999;288:911–940. [PubMed]
16. Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc. Natl Acad. Sci. USA. 2004;101:7287–7292. [PubMed]
17. Deigan KE, Li TW, Mathews DH, Weeks KM. Accurate SHAPE-directed RNA structure determination. Proc. Natl Acad. Sci. USA. 2009;106:97–102. [PubMed]
18. Watts JM, Dang KK, Gorelick RJ, Leonard CW, Bess JW, Jr, Swanstrom R, Burch CL, Weeks KM. Architecture and secondary structure of an entire HIV-1 RNA genome. Nature. 2009;460:711–716. [PMC free article] [PubMed]
19. McCaskill JS. The Equilibrium Partition Function and Base Pair Binding Probabilities for RNA Secondary Structure. Biopolymers. 1990;29:1105–1119. [PubMed]
20. Ding Y, Lawrence CE. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res. 2003;31:7280–7301. [PMC free article] [PubMed]
21. Lu Z, Gloor J, Mathews D. Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA. 2009;15:1805–13. [PubMed]
22. Do C, Woods D, Batzoglou S. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics. 2006;22:e90–8. Jul. [PubMed]
23. Gardner P, Giegerich R. A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics. 2004;5:140. [PMC free article] [PubMed]
24. Vasa SM, Guex N, Wilkinson KA, Weeks KM, Giddings MC. ShapeFinder: a software system for high-throughput quantitative analysis of nucleic acid reactivity information resolved by capillary electrophoresis. RNA. 2008;14:1979–1990. [PubMed]
25. Garst AD, Edwards AL, Batey RT. Riboswitches: structures and mechanisms. Cold Spring Harb. Perspect. Biol. 2011;3:a003533. [PMC free article] [PubMed]
26. Schultes EA, Bartel DP. One Sequence, Two Ribozymes: Implications for the Emergence of New Ribozyme Folds. Science. 2000;289:448–452. [PubMed]
27. Zhuang X, Kim H, Pereira MJB, Babcock HP, Walter NGW, Chu S. Correlating Structural Dynamics and Function in Single Ribozyme Molecules. Science. 2002;296:1473–1476. [PubMed]
28. Huang Z, Pei W, Han Y, Jayaseelan S, Shekhtman A, Shi H, Niu L. One RNA aptamer sequence, two structures: a collaborating pair that inhibits AMPA receptors. Nucleic Acids Res. 2009;37:4022–4032. [PMC free article] [PubMed]
29. Waldminghaus T, Kortmann J, Gesing S, Narberhaus F. Generation of synthetic RNA-based thermosensors. Biol. Chem. 2008;389:1319–1326. [PubMed]
30. Helm M. Post-transcriptional nucleotide modification and alternative folding of RNA. Nucleic Acids Res. 2006;34: 721–733. [PMC free article] [PubMed]
31. Motorin Y, Helm M. tRNA stabilization by modified nucleotides. Biochemistry. 2010;49:4934–4944. [PubMed]
32. Helm M, Brulé H, Degoul F, Cepanec C, Leroux J, Giegé R, Florentz C. The presence of modified nucleotides is required for cloverleaf folding of a human mitochondrial tRNA. Nucleic Acids Res. 1998;26:1636–1643. [PMC free article] [PubMed]
33. Czerwoniec A, Dunin-Horkawicz S, Purta E, Kaminska K, Kasprzak J, Bujnicki J, Grosjean H, Rother K. MODOMICS: a database of RNA modification pathways 2008 update. Nucleic Acids Res. 2009;37:D118–D121. [PMC free article] [PubMed]
34. Mayer O, Windbichler N, Wank H, Schroeder R. Protein-Induced RNA Switches in Nature. Eurekah Bioscience. 2005;1:177–184.
35. Geissmann T, Touati D. Hfq, a new chaperoning role: binding to messenger RNA determines access for small RNA regulator. EMBO J. 2004;23:396–405. [PubMed]
36. Kladwang W, Cordero P, Das R. A mutate-and-map strategy accurately infers the base pairs of a 35-nucleotide model RNA. RNA. 2011;17:522–34. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press