PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
 
PLoS One. 2012; 7(10): e45160.
Published online 2012 October 16. doi:  10.1371/journal.pone.0045160
PMCID: PMC3473038

Integrating Chemical Footprinting Data into RNA Secondary Structure Prediction

Cynthia Gibas, Editor

Abstract

Chemical and enzymatic footprinting experiments, such as shape (selective 2′-hydroxyl acylation analyzed by primer extension), yield important information about RNA secondary structure. Indeed, since the An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e001.jpg-hydroxyl is reactive at flexible (loop) regions, but unreactive at base-paired regions, shape yields quantitative data about which RNA nucleotides are base-paired. Recently, low error rates in secondary structure prediction have been reported for three RNAs of moderate size, by including base stacking pseudo-energy terms derived from shape data into the computation of minimum free energy secondary structure. Here, we describe a novel method, RNAsc (RNA soft constraints), which includes pseudo-energy terms for each nucleotide position, rather than only for base stacking positions. We prove that RNAsc is self-consistent, in the sense that the nucleotide-specific probabilities of being unpaired in the low energy Boltzmann ensemble always become more closely correlated with the input shape data after application of RNAsc. From this mathematical perspective, the secondary structure predicted by RNAsc should be ‘correct’, in as much as the shape data is ‘correct’. We benchmark RNAsc against the previously mentioned method for eight RNAs, for which both shape data and native structures are known, to find the same accuracy in 7 out of 8 cases, and an improvement of 25% in one case. Furthermore, we present what appears to be the first direct comparison of shape data and in-line probing data, by comparing yeast asp-tRNA shape data from the literature with data from in-line probing experiments we have recently performed. With respect to several criteria, we find that shape data appear to be more robust than in-line probing data, at least in the case of asp-tRNA.

Introduction

RNA is an important biomolecule, known to play both an information carrying and a catalytic role. RNA plays roles in numerous biological processes, including retranslation of the genetic code (selenocysteine insertion, ribosomal frameshift), transcriptional and translational gene regulation, temperature-dependent allosteric regulation, chemical modification of specific nucleotides in the ribosome, regulation of alternative splicing, apparent regulation of the formation of heterochromatin, etc. (See [1] for a recent review on the analysis of sequence and structure of such noncoding RNA.) Since the function of non-coding RNA largely depends on its structure and since it is believed that RNA plays many yet undiscovered roles in cellular processes, it is important to determine the structure of RNA.

A secondary structure for a given RNA nucleotide sequence An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e002.jpg is a set An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e003.jpg of base pairs An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e004.jpg, such that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e005.jpg forms either a Watson-Crick or GU (wobble) base pair, and such that there are no base triples or pseudoknots in An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e006.jpg. In this context, a base triple in An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e007.jpg consists of two base pairs An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e008.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e009.jpg or An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e010.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e011.jpg. A pseudoknot in An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e012.jpg consists of two base pairs An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e013.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e014.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e015.jpg. Although it is NP-hard [2] to compute the minimum free energy (MFE) tertiary (or even pseudoknotted) structure of RNA [3], the MFE secondary structure can be computed in time that is cubic in the input sequence length [4]. Moreover, it is widely believed that RNA folds in a hierarchical fashion [5][8], with the secondary structure acting as a scaffold for tertiary structure, although this is not universally accepted [9].

RNA secondary structure can be predicted by Zuker and Stiegler's algorithm [4], implemented in mfold [10], RNAfold [11], and RNAstructure [12]. This algorithm uses dynamic programming with free energy parameters from the Turner energy model [13] to compute the minimum free energy (MFE) structure.

A first step towards integrating chemical/enzymatic probing data was taken by Mathews et al. [14], where Zuker and Stiegler's algorithm was modified to support hard constraints reflecting the experimental data. In particular, given an RNA sequence, the software RNAstructure [14] computed the minimum free energy (MFE) secondary structure subject to user-defined constraints, such as stipulating that particular nucleotides remain unpaired, that pairs of specific nucleotides form a base pair, etc. Mathews et al. reported that the MFE structure prediction with (hard) constraints corresponding to chemical modification (1-cyclohexyl-3-(2-morpholinoethyl) carbodiimide metho-p-toluene sulfonate, dimethyl sulfate, and kethoxal) yielded an improvement in base-pair accuracy for 5S rRNA of E. coli from 26.3% to 86.8% [14]. (See [15] for more remarks and a less optimistic evaluation of RNAstructure with hard constraints on 16S rRNA.)

Chemical/enzymatic probing data is probabilistic in nature, as exemplified in pars footprinting data [16]. Rarely is it absolutely clear that certain positions are unpaired, or that certain base pairs are formed; instead, there is a certain probability of these events. In moving away from error-prone hard constraints, Deigan et al. [15] took a second step of incorporating shape (selective An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e016.jpg-hydroxyl acylation analyzed by primer extension) data [17], [18], whose numerical values (continuously) range from 0 to approximately 2.2, by incorporating a pseudo free energy for base stacking into the Zuker algorithm. The pseudo free energy term in [15] was defined to be

equation image
(1)

where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e018.jpg kcal/mol and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e019.jpg kcal/mol, for each position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e020.jpg occurring in a base pairing stack; if An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e021.jpg is unpaired, then no pseudo free energy is added. (The position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e022.jpg is in a base pairing stack if An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e023.jpg are base pairs, or if An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e024.jpg are base pairs belonging to the secondary structure. For base pairs An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e025.jpg that are surrounded by base pair neighbors An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e026.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e027.jpg, the pseudo-energy term is applied twice.) The resulting modified version of Zuker and Stiegler's algorithm, as implemented in RNAstructure was reported to yield secondary structure prediction accuracies of up to An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e028.jpg for three moderate-sized RNAs (An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e029.jpg nt) and for 16S rRNA (An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e030.jpg nt). Wilkinson et al. [19] later described a model for the secondary structure of the HIV-1 genome, as computed by RNAstructure with shape pseudo energies defined in equation 1. If correct, this is a remarkable feat, given that the size of the HIV-1 genome is generally just under 10,000 nt (see http://www.hiv.lanl.gov), hence several times larger than the ribosome, whose crystal structure was only determined after years of painstaking work (the large unit, PDB code 1FFK [20], of the ribosome of Haloarcula marismortui consists of a 23S chain of length 2,922 nt and a 5S chain of 122 nt).

One issue with this approach is that it takes into consideration shape data only for base-stacked positions, i.e., a pseudo free energy term corresponding to shape data is applied at positions where a stacked base pair occurs, but not where nucleotides are unpaired. By ignoring shape data for unpaired nucleotide positions, this approach can thus bias structure prediction to form base pairs even at positions, which shape data may suggest are flexible. Indeed the expected distance of predicted base pairing probabilities computed by RNAstructure with shape values increases after the incorporation of the shape pseudo energy terms (see Table 1). (As later defined, RNAstructure and RNAsc both compute the probability An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e031.jpg that base pair An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e032.jpg belongs to a structure in the low energy Boltzmann ensemble. Since the pseudo energy model for shape data incorporation is different in RNAstructure and RNAsc, the base pairing probabilities and Boltzmann low energy ensembles may be different.) In contrast to the pseudo energies of RNAstructure, our algorithm RNAsc, will always shift the distribution of conformations towards the shape measurements (see Methods for a mathematical proof).

Table 1
Benchmark results.

Nonetheless, MFE dynamic programming methods that incorporate high throughput chemical/enzymatic footprinting data can yield important insights into the structure and function of RNA molecules, much faster than the labor-intensive X-ray diffraction methods.

The motivation for our work is to develop a method that incorporates chemical/enzymatic footprinting data in a self-consistent manner. In particular, given experimental data of the form An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e035.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e036.jpg is the experimental probability that the An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e037.jpgth nucleotide is unpaired (or, more accurately, in a flexible region, as witnessed by high shape reactivity), our goal is to develop an algorithm incorporating footprinting data such that the recalculated probabilities An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e038.jpg are guaranteed to be closer to the experimental measurements. If our algorithm is self-consistent in this manner, then we have strong mathematical evidence that the partition function computation and hence the MFE computation are both as correct as is the shape data. In contrast to the pseudo energies of RNAstructure, we prove that our algorithm RNAsc is self-consistent, and on average, the ensemble of low energy secondary structures produced by our method yields a footprinting pattern that closely resembles the pattern from input experimental shape data. We benchmark our method against the RNAstructure program [19] on eight RNAs, for which shape data and native structures are both available. The secondary structure predictions from our method and from RNAstructure are fairly similar and both significantly improve secondary structure prediction without incorporation of footprinting data (e.g. mfold, RNAfold). However, the expected distance of the computed probabilities with the shape data is lower in our method for all the test cases. It is worth noting that the mistakes in the predicted secondary structure usually occur in positions where the shape data might be inaccurate, or where the native structure and shape data structures could be somewhat different, due to quite different temperatures required by each experimental protocol. Recent studies have shown that different experimental mapping approaches can provide complementary structural information [21]. Thus, we additionally performed in-line probing [22], [23] on asp-tRNA, in order to compare the results of shape and in-line probing in the context of our algorithm. The source code of RNAsc as well as a web server is available at http://bioinformatics.bc.edu/clotelab/RNAsc/.

Methods

In-line probing experiments

DNA oligonucleotides for the sequence and its reverse complement were purchased from MWG Operon; remaining reagents were obtained from Sigma-Aldrich. DNA oligonucleotides were annealed to create templates for T7 polymerase transcription, and the transcription products were purified by denaturing PAGE and eluted in 10 mM Tris-HCl (pH 7.5 at An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e039.jpgC), 200 mM NaCl and 1 mM EDTA. Following in-line probing protocols designed by the Breaker Lab [22], [23], synthesized RNA molecules were dephosphorylated using alkaline phosphatase (Roche Diagnostics) and radiolabeled with [g-32P]ATP and T4 polynucleotide kinase (NEB) according to the manufacturers instructions. Spontaneous transesterification reactions using PAGE-purified, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e040.jpg endlabeled RNAs were assembled as described in [23]. Incubations were performed for approximately 40 h at An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e041.jpgC in 10-uL volumes containing 50 mM Tris-HCl (pH 8.3 at An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e042.jpgC), 20 mM MgCl2, 100 mM KCl and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e043.jpg nM RNA. RNA fragments resulting from spontaneous transesterification were resolved by denaturing 10% PAGE, and imaged with a Molecular Dynamics STORM PhosphorImager. Quantification of gels were performed using SAFA (Semi-Automated Footprinting Analysis) [24]. In-line probing experiments were repeated an additional two times, resulting in gels with comparable data (data not shown). Fig. 1 is an image of the in-line probing gel for yeast asp-tRNA.

Figure 1
In-line probing.

Computational methods

Briefly stated, our algorithm, RNAsc (RNA soft constraints), consists of a preprocessing step, that normalizes shape data to the range An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e047.jpg, followed by a computation of the minimum free energy [resp. partition function], which incorporates pseudo-energy terms [resp. Boltzmann factors of pseudo-energy terms] for each nucleotide position. We begin by discussion of the normalization of shape data.

Normalization of shape

In experiments reported by the Weeks Lab [25] as well as the Das Lab [26], shape reactivities range from An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e048.jpg to roughly An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e049.jpg. Large reactivities suggest that the position is unpaired; small reactivities suggest that the position is base-paired. More specifically, nucleotides with shape reactivities An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e050.jpg or An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e051.jpg are considered highly and moderately reactive, respectively [15]. The normalization is carried out in a piecewise linear fashion where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e052.jpg will be roughly mapped to An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e053.jpg. However, very low shape reactivities should not be mapped close to An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e054.jpg either as it will bias the shape values toward unpaired nucleotides. For this reason the shape reactivity values An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e055.jpg are linearly mapped to the interval An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e056.jpg, the reactivity values in An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e057.jpg are linearly mapped to the interval An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e058.jpg, the reactivity values in An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e059.jpg are linearly mapped to the interval An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e060.jpg, and lastly, the reactivities An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e061.jpg are linearly mapped to the interval An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e062.jpg. The selection of the threshold values are motivated by the moderate and high reactivity thresholds as reported in [15] and the examination of the cumulative distribution of the shape data (see File S1). The in-line probing data was normalized by mapping the outliers at the An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e063.jpg and the An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e064.jpg quantiles to An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e065.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e066.jpg respectively and normalizing the rest of the data to An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e067.jpg linearly. Fig. 2 shows a plot of the normalized and raw shape values as well as the normalization map.

Figure 2
Normalization.

Boltzmann weights

Let An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e068.jpg be a fixed RNA sequence of length An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e069.jpg, for which we are given normalized shape or in-line probing reactivity data An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e070.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e071.jpg. For An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e072.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e073.jpg, define the Boltzmann weight

equation image
(2)

where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e075.jpg is a scaling parameter, and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e076.jpg measures the discrepancy between An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e077.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e078.jpg. We will later incorporate Boltzmann weights in a weighted partition function An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e079.jpg, in a manner that reweights the ensemble of low energy conformations towards the shape data. When later used in recurrence relations for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e080.jpg, the variable An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e081.jpg is the indicator function for whether a position is unpaired An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e082.jpg or paired An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e083.jpg in a secondary structure under consideration. In the case of missing values, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e084.jpg may be assigned to An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e085.jpg, which represents no information about base pairing.

Weighting the partition function

In this section, we describe how to integrate Boltzmann weights into the computation of the partition function for secondary structures of a given RNA sequence.This allows us to compute the probability An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e086.jpg [resp. An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e087.jpg] that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e088.jpg is a base pair in the Boltzmann ensemble of structures, where weights for shape or in-line probing have not [resp. have] been taken into consideration. As later explained, we will compare the probability An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e089.jpg with normalized shape reactivity An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e090.jpg. Let An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e091.jpg denote the subsequence An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e092.jpg of a given, fixed RNA sequence An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e093.jpg of length An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e094.jpg. For An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e095.jpg, the McCaskill [27] partition function An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e096.jpg is defined by An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e097.jpg, where the sum is taken over all secondary structures An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e098.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e099.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e100.jpg is the free energy of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e101.jpg with respect to the Turner energy model [13], [28], An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e102.jpg An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e103.jpg is the universal gas constant, and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e104.jpg absolute temperature. The goal of the current paper is to integrate the previously defined weights into the partition function. We first require some notation. Here, we write An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e105.jpg, etc. instead of the more cumbersome notation An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e106.jpg, etc. Thus An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e107.jpg etc. depend on the normalized footprinting data An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e108.jpg, although An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e109.jpg will not be explicitly mentioned.

Definition 1 (Weighted partition function)

Define

  • An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e110.jpg: weighted partition function over all secondary structures of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e111.jpg.
  • An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e112.jpg: weighted partition function over all secondary structures of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e113.jpg, which contain the base pair An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e114.jpg.
  • An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e115.jpg: weighted partition function over all secondary structures of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e116.jpg, subject to the constraint that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e117.jpg is part of a multiloop and has at least one component.
  • An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e118.jpg: weighted partition function over all secondary structures of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e119.jpg, subject to the constraint that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e120.jpg is part of a multiloop and has exactly one component. Moreover, it is required that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e121.jpg base-pair in the interval An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e122.jpg; i.e. An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e123.jpg is a base pair, for some An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e124.jpg.

To compute partition function An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e125.jpg, we compute by dynamic programming An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e126.jpg for all An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e127.jpg by increasing values of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e128.jpg. Structures on An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e129.jpg can be subdivided into those for which An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e130.jpg is unpaired in An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e131.jpg, thus contributing An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e132.jpg times Boltzmann factor for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e133.jpg to be unpaired, and those for which An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e134.jpg is paired with An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e135.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e136.jpg, thus contributing An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e137.jpg times Boltzmann factor for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e138.jpg to be paired. Subsequently An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e139.jpg is computed by adding a contribution for all loops closed by base pair An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e140.jpg, i.e., hairpins, bulges, internal loops and multi loops whose latter contribution is recursively computed by jultiloop partition functions An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e141.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e142.jpg. In essence, we apply Boltzmann weights to each nucleotide position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e143.jpg, while accounting for a distinct weight depending on whether An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e144.jpg is paired or unpaired in the structure An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e145.jpg under consideration: weight An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e146.jpg if An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e147.jpg is unpaired in An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e148.jpg, weight An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e149.jpg if An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e150.jpg is base-paired in An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e151.jpg. If all weights were set to An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e152.jpg, then the weighted partition function would be equivalent to the classic partition function. Similar forms of rearranging and reweighting of the partition function have been applied in the context of single stranded RNA binding proteins [29]. Details now follow. It will be expedient to define the function An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e153.jpg, which represents the weight corresponding to a loop region in which An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e154.jpg are unpaired. For An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e155.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e156.jpg, while for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e157.jpg,

equation image
(3)

In the base case, we define An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e159.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e160.jpg for all An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e161.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e162.jpg is the minimum number of unpaired bases in a hairpin loop (generally An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e163.jpg). In the inductive case, where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e164.jpg, we define

equation image
(4)

Note that in the above equation An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e166.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e167.jpg correspond to the weights for the nucleotides An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e168.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e169.jpg being paired, but not necessarily to one another. If extra information on the pairing status of the nucleotides is available, (e.g., as in ‘mutate and map’ experiments [30]), these weights may be corrected accordingly to reflect the weight for the pairing of the An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e170.jpgth and the An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e171.jpgth nucleotides. Let An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e172.jpg denote the free energy of a hairpin and let An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e173.jpg denote the free energy of an internal loop (which combines the cases of stacked base pair, bulge and proper internal loop). The free energy for a multiloop containing An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e174.jpg base pairs and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e175.jpg unpaired bases is given by the affine approximation An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e176.jpg. The weighted partition function closed by base pair An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e177.jpg is given by

equation image
(5)

The weighted multiloop partition function with a single component and where position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e179.jpg is required to base-pair in the interval An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e180.jpg is given by

equation image
(6)

Finally, the weighted multiloop partition function with one or more components, having no requirement that position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e182.jpg base-pair in the interval An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e183.jpg is given by

equation image
(7)

The weighted Boltzmann probability of base pair An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e185.jpg is defined by

equation image
(8)

where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e187.jpg – see Methods. Following Zuker [31], the inner and outer partition function is computed, from which we easily obtain An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e188.jpg.

The minimum free energy (MFE) structure can be computed by a modification of McCaskill's algorithm [27], where the weighted partition function is modified by replacing summations by minimizations, products by sums, and replacing the weights by An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e189.jpg. Although we did implement this algorithm, it does not include energy contributions for stacked, single-stranded nucleotides (dangles) or coaxial stacking, both known to be important in improving secondary structure prediction accuracy. For this reason, we modified the source code of RNAstructure, for both the MFE as well as the partition function computation which implements dangles and coaxial stacking. See File S1 for details. As in [15], the value of the scaling parameter An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e190.jpg, is determined by a search to optimize positive predictive value and sensitivity.

Measures of uncertainty in the predicted low-energy ensemble of conformations

Pointwise entropy and Morgan-Higgs structural diversity [32] were used as measures of uncertainty in the prediction of the secondary structure. The poinwise entropy is defined as follows. For each fixed An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e191.jpg in An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e192.jpg, define probability distribution An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e193.jpg on An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e194.jpg by setting An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e195.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e196.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e197.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e198.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e199.jpg. Pointwise entropy An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e200.jpg measures the variability in nucleotides found to be base-paired with An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e201.jpg in the Boltzmann ensemble of low energy structures. The pointwise entropy without the probing data is computed similarly using the probabilities An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e202.jpg. To reflect the nature of the probing data, we modified this definition as follows. Define the binary pointwise entropy at position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e203.jpg by An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e204.jpg. Binary entropy measures the uncertainty in the An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e205.jpgth nucleotide being paired or unpaired, reflecting the signal detected by probing data. Similar computations were done with An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e206.jpg (the base pairing probabilities without the integration of the weights). The Morgan-Higgs structural diversity is defined by An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e207.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e208.jpg is defined by An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e209.jpg. Similar computations were done with An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e210.jpg.

RNAsc is guaranteed to improve agreement with shape data

In this section, we show that on average, the ensemble of low energy secondary structures produced by our method yields a footprinting pattern that more closely resembles the pattern from input experimental shape data; in particular, we prove that the expected distance from (normalized) shape data for the ensemble of low energy structures (our algorithm) is strictly less than the expected distance from shape data for the Boltzmann ensemble of low energy structures (McCaskill's algorithm). First, we require some definitions. All secondary structures An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e211.jpg considered in this section will be tacitly assumed to be secondary structures of the RNA molecule An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e212.jpg. Each secondary structure An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e213.jpg can be assigned a binary sequence An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e214.jpg so that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e215.jpg if the nucleotide An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e216.jpg is paired and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e217.jpg otherwise. Given experimental shape data yielding probabilities An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e218.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e219.jpg is the probability that nucleotide An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e220.jpg is unpaired, the distance of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e221.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e222.jpg is defined by:

equation image
(9)

The shape weight of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e224.jpg is defined to be

equation image
(10)

The weighted partition function then becomes

equation image
(11)

The Boltzmann probability An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e227.jpg of secondary structure An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e228.jpg is defined by

equation image
(12)

and the weighted Boltzmann probabity An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e230.jpg is defined by

equation image
(13)

Define the critical distance An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e232.jpg by

equation image
(14)

Note that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e234.jpg does not depend on any particular secondary structure An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e235.jpg, although it does depend on An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e236.jpg and of course the input RNA sequence An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e237.jpg. It follows from definitions that for any secondary structure An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e238.jpg,

equation image
(15)

and strict inequalities hold as well. Indeed, since the exponential function is increasing, we have An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e240.jpg if and only if

equation image

Multiplying each side by An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e242.jpg, the above inequality can be written as

equation image

from which (15) follows. Similarly,

equation image
(16)

Next, define the expected distance An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e245.jpg between An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e246.jpg, obtained by normalizing shape data, and the ensemble of low energy structures as follows:

equation image
(17)

Similarly, define the SHAPE weighted expected distance An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e248.jpg between An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e249.jpg and the ensemble of low energy structures by

equation image
(18)

Let An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e251.jpg represent the sorted distances An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e252.jpg between all secondary structures of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e253.jpg, for given normalized shape data An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e254.jpg. Here An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e255.jpg denotes the total number of secondary structures. Note that there may be many distinct secondary structures that have a given distance An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e256.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e257.jpg; i.e. possibly many distinct An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e258.jpg for which An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e259.jpg. Let An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e260.jpg be the largest index An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e261.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e262.jpg; it follows that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e263.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e264.jpg. Let An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e265.jpg [resp. An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e266.jpg] consist of those secondary structures An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e267.jpg, such that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e268.jpg [resp. An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e269.jpg]; in other words

equation image
equation image

Theorem 1: For any given RNA sequence An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e272.jpg and normalized SHAPE data An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e273.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e274.jpg.

Proof:

equation image
equation image
equation image

To justify the inequality, note that for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e278.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e279.jpg, hence for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e280.jpg, we have An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e281.jpg. On the other hand, for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e282.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e283.jpg, hence for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e284.jpg, we also have An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e285.jpg. Finally, the last line follows from the fact that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e286.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e287.jpg are both probability distributions, hence An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e288.jpg. This completes the proof that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e289.jpg.

The above theorem can be generalized; however, we first require some notation. The weighted partition function An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e290.jpg, weighted Boltzmann probability An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e291.jpg, and weighted expected distance An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e292.jpg were respectively defined in Equations (11),(13), and (18). When we wish to make the weighting parameter An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e293.jpg explicit, we instead write An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e294.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e295.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e296.jpg. The following theorem shows that as the parameter An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e297.jpg increases, the expected distance to normalized shape data decreases:

Theorem 2: For any given RNA sequence An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e298.jpg, normalized SHAPE data An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e299.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e300.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e301.jpg; moreover, strict inequalities hold as well.

The proof the the theorem can be found in File S1.

Quadratic time computation of expected distance from shape data

Given RNAsc parameter An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e302.jpg, recall that we defined the An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e303.jpg-expected distance An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e304.jpg between An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e305.jpg, obtained by normalizing shape data, and the ensemble of low energy structures by

equation image
(19)

In the main text, we wrote An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e307.jpg, instead of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e308.jpg when An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e309.jpg.

In trying to compute An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e310.jpg by definition, we seemingly require the sum over exponentially many secondary structures, or at least to approximate this sum by summing over a reprentative sample of structures, sampled from the low energy ensemble. This is not necessary. Here, we show how to compute An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e311.jpg from the base pairing probabilities An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e312.jpg, thus leading to a quadratic time algorithm.

By definition,

equation image

where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e314.jpg is denotes the indicator function. Now for any fixed An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e315.jpg,

equation image

is equal to

equation image
(20)

Since An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e318.jpg, it follows that Equation (20) is equal to

equation image
(21)

It follows that

equation image

The values An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e321.jpg are computed in quadratic time from McCaskill's algorithm, and subsequently stored in an array. If follows that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e322.jpg can be computed in quadratic time.

Since RNAstructure of Deigan et al. [15] takes unnormalized shape data in the range from An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e323.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e324.jpg, we define the expected distance An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e325.jpg between unnormalized shape data and structure An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e326.jpg to be

equation image
(22)

where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e328.jpg denotes the unnormalized shape data at position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e329.jpg. The expected distance An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e330.jpg between unnormalized shape data and the ensemble of low energy structures computed by RNAstructure with incorporated shape data by

equation image
(23)

Scrutiny of the proof just given yields an efficient computation of

equation image
(24)

Since the approach in [15] only considers stacked base pairs, it seems very likely that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e333.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e334.jpg denotes the expected distance from shape data for the Boltzmann ensemble of low energy structures after the incorporation of the shape pseudo energy terms as in [15]. Indeed, the expected distance we obtain between unnormalized input shape data An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e335.jpg and the computed probabilities An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e336.jpg demonstrates this fact (see Table 1).

Results

In this section we present the benchmarking results for our algorithm RNAsc, a novel algorithm that recalibrates probing data as probabilities of nucleotides being unpaired and integrates this information as ‘soft constraints’ into the computation of minimum free energy secondary structure (see Methods). Furthermore, we present a direct comparison of in-line probing data and shape data for yeast asp-tRNA.

Analysis of shape and in-line probing for structure prediction

In order to directly characterize how well shape data reflects RNA secondary structure, we compared normalized shape data with base pairing status, as determined from crystallographic or NMR structures. We define shape distance to equal the difference between normalized shape reactivity (see Methods), scaled from An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e337.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e338.jpg (see section Normalization of shape) and binary base pairing status, with 0 for paired, 1 for unpaired, as derived from NMR or crystal structure. Using shape data for S. cerevisiae apartyl-tRNA [25], HCV IRES [15], bI3 group I intron p456 [33], E. coli phenylalanine-tRNA [26], E. coli 5S RNA [26], and Fusobacterium nucleatum glycine riboswitch [26], we computed shape distance at each nucleotide. We observed that at many positions the shape distance has an absolute value greater than 0.5, thus indicating a significant difference between shape reactivity and the actual secondary structure. We refer to these positions as discrepancies. Over the the set of RNAs we examined, between An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e339.jpg of the total data corresponded to such discrepancies (Fig. 3 and File S1). Many factors can account for these discrepancies, including differences between the crystal structure and the ensemble of structures in solution, potential tertiary contacts, and differential reactivity to the chemical agent.

Figure 3
Shape discrepancies.

To assess whether an alternative experimental method might yield data that more accurately reflects the secondary structure, we performed in-line probing on the S. cereviseae aspartyl-tRNA, for which shape data is available [25]. Like shape, in-line probing is a measure of backbone flexibility, where nucleotides in loops and other unpaired regions are generally more reactive than those that are base-paired [34]. In-line probing takes advantage of the spontaneous transesterification reactions responsible for RNA degradation that occur only when the An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e340.jpgO from one nucleotide and the An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e341.jpgO of the next align in a 180 degree conformation around the phosphate. This conformation does not occur in the A-form helix, thus protecting linkages within the helix from cleavage. In-line probing and shape are thus likely to yield similar, but not equivalent data [35].

Our analysis indicates that in-line probing and shape reactivity profiles are quite distinct from one another. See Fig. 4 for a comparison of shape and in-line probing profiles and File S1 for shape reactivity profiles of other RNA molecules.

Figure 4
Comparison of In-line probing and shape.

The signal from in-line probing is significantly more diffuse than that from shape, and the error rate, as calculated above for shape, is significantly higher (An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e344.jpg vs. An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e345.jpg). Thus shape is a better reflection of secondary structure than in-line probing, at least in the case of asp-tRNA.

Integrating shape and in-line probing data into our new algorithm RNAsc also shows that shape has an edge over in-line probing. The structures predicted by RNAsc for yeast asp-tRNA using in-line probing and shape data are both identical to the crystal structure. However, one measure of the robustness of the data in the context of our secondary structure prediction algorithm RNAsc is the range of the scaling parameter An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e346.jpg over which the correct structure can be recovered. Recall that An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e347.jpg is a weight parameter (see section Boltzmann Weights for details). We conducted a search for parameter An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e348.jpg for yeast asp-tRNA, using both in-line probing data and shape data. We found that when using in-line probing data, RNAsc produced the target structure for asp-tRNA only for a very narrow range of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e349.jpg, while when using shape data, this range was much larger (see Fig. 5). See Fig. 6 for a heat map of in-line vs. shape reactivity for asp-tRNA.

Figure 5
Optimal parameter value.
Figure 6
Heat maps of in-line probing and shape.

In a second analysis, we compared the pointwise entropy at each nucleotide using no data, shape data, and in-line probing data (see Fig. 7). We observe that shape data decreases the average entropy more than in-line probing data. However, we also observe that there are positions where the in-line probing decreases the entropy more than shape, suggesting that combinations of different experimental approaches may be able to yield additional information.

Figure 7
Pointwise entropies.

Validation of RNAsc

Using shape data from the Weeks Lab, we tested RNAsc on aspartyl-tRNA from S. cerevisiae, domain II of the hepatitis C virus internal ribosomal entry site (HCV IRES), and the P546 domain of the bI3 group I intron, from E. coli. Additionally, using shape data from the Das Lab, we tested RNAsc on E. coli phenylalanine tRNA (phe-tRNA), E. coli 5S ribosomal RNA (5S rRNA), and the glycine riboswitch from F. nucleatum with PDB code 3P49. As ‘gold standard’ structures, we used NMR structure for P546, and X-ray structures for remaining RNAs. Parameter used for RNAsc is An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e358.jpg, determined by search (see Fig. 5) to optimize sensitivity (proportion of true positives that are correctly identified) and positive predictive value (proportion of positive results that are true positives). Slippage of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e359.jpg [15], [36] is not allowed, contrary to benchmarking results of some authors. Here, slippage [36] means that if base pair An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e360.jpg is in the true structure, then the base pair An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e361.jpg is counted as “correctly” predicted, if one of the base pairs An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e362.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e363.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e364.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e365.jpg appears in the predicted structure – we do not allow slippage in the results of this paper.

Table 1 presents a comparison of RNAsc with RNAstructure, including a comparison of structural variation in the ensemble of low energy structures. This variation is computed by pointwise entropy and Morgan-Higgs structural diversity (see Methods). The table shows that the low energy ensemble, as computed by RNAsc with integration of shape data, has intermediate variation between that computed by RNAstructure with and without shape data. The fact that RNAstructure with incorporated shape data computes an ensemble of structures with less variation appears to be expected, given the parameters used in the algorithm of Deigan et al. [15].

As explained in Deigan et al. [15], RNAstructure incorporates shape data by including a pseudo free energy term

equation image
(25)

for a nucleotide position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e367.jpg. In the source code RNAstructure, it is clear that the pseudo free energy term An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e368.jpg is applied only for positions An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e369.jpg involved in a stacked base pair. The optimal values for slope An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e370.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e371.jpg-intercept An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e372.jpg are obtained by grid search when maximizing structure prediction accuracy on certain known structures. Optimal slope and intercept values reported in [15] are An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e373.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e374.jpg kcal/mol.

We now show that the smaller structural variation in the RNAstructure ensemble appears to be an artifact of the magnitude of parameters An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e375.jpg. Consider the two most extreme cases: (1) position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e376.jpg in structure An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e377.jpg is base-paired, but shape reactivity is a maximum, (2) position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e378.jpg in structure An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e379.jpg is not paired, but shape reactivity is a minimum.

Suppose that position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e380.jpg is in a base-stacked region but the shape reactivity at position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e381.jpg is An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e382.jpg, a maximum, though there are sometimes shape reactivities larger than An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e383.jpg. With the default parameters for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e384.jpg, the pseudo free energy contribution of RNAstructure is An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e385.jpg, an energetic penalty. This penalty is quite large, given the fact that the largest (in absolute value) free energy contribution for base stacking is An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e386.jpg kcal/mol [37]. Under the same assumptions, RNAsc would have a pseudo free energy of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e387.jpg, also an energetic penalty, yet much smaller than that of RNAstructure.

Suppose now that position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e388.jpg is in a loop region but the shape reactivity at position An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e389.jpg is An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e390.jpg, the least possible value. Using the default parameters An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e391.jpg kcal/mol, the pseudo free energy contribution of RNAstructure, if applied in this case, would then be An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e392.jpg. This value, paradoxically, would be an energetic bonus, although the predicted structure disagrees with shape data! It is presumably for this reason that Deigan et al. do not apply any pseudo free energy term to nucleotide positions An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e393.jpg located in a loop region. In contrast, under the same assumptions, RNAsc would have a pseudo free energy of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e394.jpg, again a penalty – moreover, the same penalty of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e395.jpg kcal/mol is applied in each of the cases (1) and (2) just discussed.

From these illustrative examples, it is suggestive that structural variability, as measured by pointwise entropy and structural diversity, in the low energy ensemble calculated by RNAstructure is higher than that of the RNAsc low energy ensemble, due to the magnitude of the parameters An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e396.jpg used in RNAsc.

Note that the average relative decrease in expected distance of the computed probabilities to shape data from RNAstructure to RNAsc is An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e397.jpg. In fact the expected distance of the computed probabilities to shape increases for RNAstructure and decreases for RNAsc after the incorporation of shape in each case. Apart from the ‘self-consistent’ nature of our algorithm, not shared by RNAstructure, the demonstrable expected distance of the computed probabilities to shape data provided by our approach, indicates that we account more fully for the shape data. It is worth mentioning that for higher values of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e398.jpg the predicted Boltzmann probabilities An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e399.jpg can be made to agree very closely with the experimental values An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e400.jpg (strong self-consistency). Fig. 8 shows a plot of the expected distance of the computed probabilities to shape data for increasing values of An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e401.jpg – see Methods for a proof. Note however that since the experimental probabilities (or normalized shape values) are generally not in perfect agreement with the native structure, we took the closeness of the predicted structure to the native structure as a measure for choosing the parameter An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e402.jpg.

Figure 8
Expected distance of predicted probabilities with normalized shape data.

We believe RNAsc may be helpful long-term in elucidating the nature of discrepancies between shape and the native structure. As in any experimental protocol, there is a Gaussian error term; however, our data (not shown) indicates that shape discrepancy is positively correlated with high pointwise entropy. Indeed, it seems plausible that a region of the RNA molecule which fluctuates due to thermal motion, thus having higher pointwise entropy, might entail a more variable accessibility for the chemical probe NMIA, thus causing a greater shape discrepancy with the X-ray structure. The program RNAsc allows the user to determine such regions of high pointwise entropy, and to see the structure variability in that region by sampling. It may be possible to confirm or refute our hypothesis concerning the non-Gaussian nature of shape discrepancy (“error”), by performing additional shape probing experiments at lower temperatures. It follows that RNAsc could prove to be a valuable tool in this line of research.

Discussion

Widespread accessibility of quantitative RNA structural mapping techniques and medium- to high-throughput quantification of the data have motivated the development of computational tools to predict structures from such information. The integration of experimental data as “constraints” in the thermodynamic algorithm when computing minimum free energy (MFE) structure can significantly improve the accuracy of RNA structure prediction. However, such methods are also dependent on the quality of the data used for the constraints [26]. It is worth mentioning that the errors in our algorithm RNAsc are directly related to the errors in the experimental data. Fig. 9 shows shape distance to the native structure at the nucleotides where the secondary structure is predicted incorrectly for glycine riboswitch. As can be seen, the shape distances to the native structure are very large for An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e419.jpg out of the An external file that holds a picture, illustration, etc.
Object name is pone.0045160.e420.jpg incorrectly predicted positions. Thus the prediction errors are due to the quality of the input data rather than limitations of the algorithm.

Figure 9
Errors in the prediction of the secondary structure of glycine riboswitch by RNAsc.

Two recent approaches towards overcoming this error include the iterative ‘sample and select’ approach of Quarrier et al. [38] and the ‘mutate and map’ strategy of Kladwang et al. [30]. The ‘sample and select’ strategy involves multiple mapping, followed by a simple filtering step, which removes the suboptimal structures (sampled from the low energy ensemble using the Sfold software [39]) that are incompatible with mapping data. In contrast, the ‘mutate and map’ strategy involves high-throughput structural probing of all single-nucleotide mutants, resulting in 2D shape data, followed by a computation of the minimum free energy structure, in which pseudo-energy base stacking terms have been added that correspond to Z-scores from 2D shape data. Although high-throughput ‘mutate and map’ strategies [30], using either shape -CE (capillary electrophoresis) or shape -Seq [40], provide very high secondary structure prediction accuracy, such methods also represent a significant increase in both experimental manipulation and cost that is often not warranted for more specific studies. Especially in such cases, we believe that our method, RNAsc, may be the tool of choice. On the other hand, the ‘mutate and map’ strategy can be normalized in such a way as to obtain base pairing probabilities. Since shape experiments can potentially probe tertiary interactions (as mentioned in the previous section), not only could we obtain probabilities for secondary interactions and canonical base pairs, but also for tertiary and long range interactions as well as non-canonical base pairs. These probabilities can later be used as input to algorithms such as Probknot [41] or even to a Maximum Weight Matching algorithm [42] to predict pseudoknotted structures and non-canonical base pairs. We are currently pursuing this line of research.

Supporting Information

File S1

Supplementary information.

(PDF)

Acknowledgments

We would like to thank D.H. Mathews for discussions and for making available the source code of RNAstructure [43], including the extension which incorporates base stacking pseudo-energies for shape data [15]. Thanks as well to R. Das for pointing us to the Stanford RNA Mapping Database http://rmdb.stanford.edu/and for a preprint of his paper on the ‘mutate and map’ strategy. We would like to thank the anonymous referees for helpful remarks.

Funding Statement

No current external funding sources for this study.

References

1. Washietl S (2010) Sequence and structure analysis of noncoding RNAs. Methods in molecular biology (Clifton, NJ) 609: 285–306. [PubMed]
2. Garey M, Johnson D (1990) Computers and Intractability: A Guide to the Theory of NPCompleteness. W.H. Freeman & Co., 338 pages pp. New York.
3. Lyngso RB, Pedersen CN (2000) RNA pseudoknot prediction in energy-based models. J Comput Biol 7: 409–427. [PubMed]
4. Zuker M, Stiegler P (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res 9: 133–148. [PMC free article] [PubMed]
5. Tinoco JI, Bustamante C (1999) How RNA folds. Journal of Molecular Biology 293: 271–281. [PubMed]
6. Banerjee A, Jaeger J, Turner D (1993) Thermal unfolding of a group I ribozyme: The lowtemperature transition is primarily disruption of tertiary structure. Biochemistry 32: 153–163. [PubMed]
7. Cho SS, Pincus DL, Thirumalai D (2009) Assembly mechanisms of RNA pseudoknots are determined by the stabilities of constituent secondary structures. Proc Natl Acad Sci USA 106: 17349–17354. [PubMed]
8. Bailor MH, Sun X, Al-Hashimi HM (2010) Topology links RNA secondary structure with global conformation, dynamics, and adaptation. Science 327: 202–206. [PubMed]
9. Wilkinson K, Merino E, Weeks K (2005) RNA SHAPE chemistry reveals nonhierarchical interactions dominate equilibrium structural transitions in tRNAAsp. J Am Chem Soc 127: 4659–4667. [PubMed]
10. Zuker M (2003) Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31 (13) 3406–3415. [PMC free article] [PubMed]
11. Hofacker I, Fontana W, Stadler P, Bonhoeffer L, Tacker M, et al. (1994) Fast folding and comparison of RNA secondary structures. Monatsch Chem 125: 167–188.
12. Mathews D, Turner D, Zuker M (2000) Secondary structure prediction. In: Beaucage S, Bergstrom D, Glick G, Jones R, editors, Current Protocols in Nucleic Acid Chemistry, New York: John Wiley & Sons. pp. 11.2.1–11.2.10.
13. Xia T, SantaLucia J, Burkard M, Kierzek R, Schroeder S, et al. (1999) Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry 37: 14719–35. [PubMed]
14. Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, et al. (2004) Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA 101: 7287–7292. [PubMed]
15. Deigan KE, Li TW, Mathews DH, Weeks KM (2009) Accurate SHAPE-directed RNA structure determination. Proc Natl Acad Sci USA 106: 97–102. [PubMed]
16. Kertesz M, Wan Y, Mazor E, Rinn JL, Nutter RC, et al. (2010) Genome-wide measurement of RNA secondary structure in yeast. Nature 467: 103–107. [PubMed]
17. Merino EJ, Wilkinson KA, Coughlan JL, Weeks KM (2005) RNA structure analysis at single nucleotide resolution by selective 2′-hydroxyl acylation and primer extension (SHAPE). J Am Chem Soc 127: 4223–4231. [PubMed]
18. Wilkinson K, Merino E, Weeks K (2006) Selective 20-hydroxyl acylation analyzed by primer extension (SHAPE): quantitative RNA structure analysis at single nucleotide resolution. NATURE PROTOCOLS-ELECTRONIC EDITION- 1: 1610. [PubMed]
19. Wilkinson KA, Gorelick RJ, Vasa SM, Guex N, Rein A, et al. (2008) High-throughput SHAPE analysis reveals structures in HIV-1 genomic RNA strongly conserved across distinct biological states. PLoS Biol 6: e96. [PMC free article] [PubMed]
20. Ban N, Nissen P, Hansen J, Moore PB, Steitz TA (2000) The complete atomic structure of the large ribosomal subunit at 2.4 A resolution. Science 289: 905–920. [PubMed]
21. Novikova IV, Hennelly SP, Sanbonmatsu KY (2012) Structural architecture of the human long non-coding RNA, steroid receptor RNA activator. Nucleic Acids Research . [PMC free article] [PubMed]
22. Mandal M, Boese B, Barrick J, Winkler W, Breaker R (2003) Riboswitches control fundamental biochemical pathways in Bacillus subtilis and other bacteria. Cell 113 (5) 577–586. [PubMed]
23. Meyer M, Roth A, Chervin S, Garcia G, Breaker R (2008) Confirmation of a second natural preQ1 aptamer class in Streptococcaceae bacteria. RNA 14: 685–695. [PubMed]
24. Das R, Laederach A, Pearlman SM, Herschlag D, Altman RB (2005) SAFA: semi-automated footprinting analysis software for high-throughput quantification of nucleic acid footprinting experiments. RNA 11: 344–354. [PubMed]
25. Wilkinson K, Merino E, Weeks K (2005) RNA SHAPE chemistry reveals nonhierarchical interactions dominate equilibrium structural transitions in tRNAAsp transcripts. Journal of the American Chemical Society 127: 4659–4667. [PubMed]
26. Kladwang W, Vanlang CC, Cordero P, Das R (2011) Understanding the Errors of SHAPE-Directed RNA Structure Modeling. Biochemistry 50: 8049–8056. [PMC free article] [PubMed]
27. McCaskill J (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29: 1105–1119. [PubMed]
28. Matthews D, Sabina J, Zuker M, Turner D (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol 288: 911–940. [PubMed]
29. Forties RA, Bundschuh R (2010) Modeling the interplay of single-stranded binding proteins and nucleic acid secondary structure. Bioinformatics 26: 61–67. [PubMed]
30. Kladwang W, Cordero P, Das R (2011) A mutate-and-map strategy accurately infers the base pairs of a 35-nucleotide model RNA. RNA 17: 522–534. [PubMed]
31. Zuker M (1989) On finding all suboptimal foldings of an RNA molecule. Science 244: 48–52. [PubMed]
32. Higgs PG (1996) Overlaps between RNA secondary structures. Physical Review Letters 76: 704–707. [PubMed]
33. Duncan C, Weeks K (2008) Shape analysis of long-range interactions reveals extensive and thermodynamically preferred misfolding in a fragile group i intron RNA. Biochemistry 47: 8504–8513. [PubMed]
34. Soukup G, Breaker R (1999) Relationship between internucleotide linkage geometry and the stability of RNA. RNA 5: 1308–1325. [PubMed]
35. Dann CE, Wakeman C, Sieling C, Baker S, Irnov I, et al. (2007) Structure and mechanism of a metal-sensing regulatory RNA. Cell 130: 878–892. [PubMed]
36. Mathews D (2005) Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics 15: 2246–2253. [PubMed]
37. Turner DH, Mathews DH (2010) Nndb: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic acids research 38: D280–2. [PMC free article] [PubMed]
38. Quarrier S, Martin J, Davis-Neulander L, Beauregard A, Laederach A (2010) Evaluation of the information content of RNA structure mapping data for secondary structure prediction. RNA 16: 1108–1117. [PubMed]
39. Ding Y, Chan CY, Lawrence CE (2004) Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res 32: 0. [PMC free article] [PubMed]
40. Lucks JB, Mortimer SA, Trapnell C, Luo S, Aviran S, et al. (2011) Multiplexed RNA structure characterization with selective 2′-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proceedings of the National Academy of Sciences of the United States of America 108: 11063–11068. [PubMed]
41. Bellaousov S, Mathews DH (2010) Probknot: fast prediction of RNA secondary structure including pseudoknots. RNA (New York, NY) 16: 1870–1880. [PubMed]
42. Tabaska J, Cary R, Gabow H, Stormo G (1998) An RNA folding method capable of identifying pseudoknots and base triples. Bioinformatics 14: 691–699. [PubMed]
43. Reuter JS, Mathews DH (2010) RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11: 129. [PMC free article] [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science