PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bioinfoLink to Publisher's site
 
Bioinformatics. 2010 June 15; 26(12): i278–i286.
Published online 2010 June 1. doi:  10.1093/bioinformatics/btq218
PMCID: PMC2881402

Thermodynamics of RNA structures by Wang–Landau sampling

Feng Lou1 and Peter Clote1,2,3,*

Abstract

Motivation: Thermodynamics-based dynamic programming RNA secondary structure algorithms have been of immense importance in molecular biology, where applications range from the detection of novel selenoproteins using expressed sequence tag (EST) data, to the determination of microRNA genes and their targets. Dynamic programming algorithms have been developed to compute the minimum free energy secondary structure and partition function of a given RNA sequence, the minimum free-energy and partition function for the hybridization of two RNA molecules, etc. However, the applicability of dynamic programming methods depends on disallowing certain types of interactions (pseudoknots, zig-zags, etc.), as their inclusion renders structure prediction an nondeterministic polynomial time (NP)-complete problem. Nevertheless, such interactions have been observed in X-ray structures.

Results: A non-Boltzmannian Monte Carlo algorithm was designed by Wang and Landau to estimate the density of states for complex systems, such as the Ising model, that exhibit a phase transition. In this article, we apply the Wang-Landau (WL) method to compute the density of states for secondary structures of a given RNA sequence, and for hybridizations of two RNA sequences. Our method is shown to be much faster than existent software, such as RNAsubopt. From density of states, we compute the partition function over all secondary structures and over all pseudoknot-free hybridizations. The advantage of the WL method is that by adding a function to evaluate the free energy of arbitary pseudoknotted structures and of arbitrary hybridizations, we can estimate thermodynamic parameters for situations known to be NP-complete. This extension to pseudoknots will be made in the sequel to this article; in contrast, the current article describes the WL algorithm applied to pseudoknot-free secondary structures and hybridizations.

Availability: The WL RNA hybridization web server is under construction at http://bioinformatics.bc.edu/clotelab/.

Contact: ude.cb@etolc

1 INTRODUCTION

RNA is an important biomolecule, now known to play both an information carrying role, as well as a catalytic role. Indeed, the genomic information of retroviruses, such as the hepatitis C and human immunodeficiency viruses, is encoded by RNA rather than DNA, while the peptidyl transferase reaction, arguably the most important enzymatic reaction responsible for life, is catalyzed not by a protein, but rather by RNA (Weinger et al., 2004). It has recently emerged that RNA plays a wide range of previously unsuspected roles in many biological processes, including retranslation of the genetic code [selenocysteine insertion (Böck et al., 1991), ribosomal frameshift (Bekaert et al., 2003)], transcriptional and translational gene regulation (Lim et al., 2003; Mandal et al., 2003), temperature-sensitive conformational switches (Chowdhury et al., 2003; Tucker and Breaker, 2005), chemical modification of specific nucleotides in the ribosome (Omer et al., 2000), regulation of alternative splicing (Cheah et al., 2007), etc.

A secondary structure for a given RNA nucleotide sequence a1,…, an is a set S of base pairs (i, j), such that ai, aj forms either a Watson–Crick or GU (wobble) base pair, and such that there are no base triples or pseudoknots in S.1 For example, the secondary structure of Y RNA2 with EMBL access code AAPY01489510/220-119 is displayed in Figure 1a and b, while Figure 1c and d depicts the pseudoknotted structure of the Gag/pro ribosomal frameshift site of mouse mammary tumor virus (Van Batenburg et al., 2001). In conventional dot-bracket notation, this latter structure is given as follows, where it should be noted that two kinds of bracket are needed due to the pseudoknot

equation image

Fig. 1.
(a and b) Pseudoknot-free secondary structure of Y RNA with EMBL access code AAPY01489510/220-119, depicted in (a) in Feynman circular form, and in panel (b) in classical form. (c and d) Pseudoknotted structure for the Gag/pro ribosomal frameshift site ...

It is computationally intractable to compute the minimum free-energy tertiary structure of RNA; indeed, determining the optimal pseudoknotted structure is nondeterministic polynomial time (NP)-complete Lyngso and Pedersen (2000). In contrast, by disallowing pseudoknots, secondary structure prediction is algorithmically tractable; there are dynamic programming algorithms to compute the minimum free-energy structure for a single RNA molecule, as well as for the hybridization of two or more RNA molecules. In particular, such methods can be loosely grouped into two types of algorithm—those that use (i) a stochastic context free grammar to compute a covariation model and (ii) free-energy parameters obtained from UV absorbance (optical melting) experiments, in order to determine the minimum free energy structure (i.e. thermodynamic-based algorithms). Examples of stochastic context-free grammars are the programs Infernal (Nawrocki et al., 2009) and Pfold (Knudsen and Hein et al., 2003). Examples of thermodynamics-based algorithms are the programs mfold (Zuker and Stiegler, 1981), UNAFOLD (Markham and Zuker, 2008), RNAfold (Hofacker et al., 1994), RNAstructure (Mathews et al., 2004). Thermodynamics-based algorithms for hybridization of two structures are given in UNAFOLD (Dimitrov and Zuker, 2004), RNAcofold (Bernhart et al., 2006; Mückstein et al., 2006), while the NUPACK software considers hybridization of three or more RNA molecules. (Dirks et al., 2007). Such thermodynamics-based algorithms are particularly important, since the tertiary structure of RNA is believed to be largely determined by secondary structure, which acts as a scaffold for tertiary contacts; see Banerjee et al. (1993) for experimental data supporting this view.3 Computing the minimum free-energy pseudoknotted structure for a given RNA sequence is NP-complete Lyngso and Pedersen (2000) for the Turner nearest neighbor energy model.4 For that reason, pseudoknot structure prediction algorithms fall into three categories: (i) exponential time exact algorithms, (ii) dynamic programming algorithms that restrict pseudoknots to a particular class and (iii) heuristic methods. Examples of exact algorithms for pseudoknot structure prediction are the branch-and-bound algorithm of (Bon, 2009) and the method using tree-width decomposition of Zhao et al. (2008). Examples of algorithms that consider only pseudoknots of a particular class are found in the pioneering work of Rivas and Eddy (1999) and Lefebvre (1995), with subsequent refinements in Dirks and Pierce (2003); Reeder and Giegerich (2004) and Ren et al. (2005) Examples of heuristic approaches include Monte Carlo methods Metzler and Nebel (2008), genetic algorithms Abrahams et al. (1990) and a simple, yet elegant algorithm called ProbKnot (D.H. Mathews, to appear) that appears to be the state-of-the art method according to recent benchmarking studies. Finally, it is beyond the scope of this article to provide additional background on algorithms for RNA structural alignment, motif detection or tertiary structure prediction.

As will be shown later, by Wang-Landau (WL) Monte Carlo methods, we can obtain essentially the same results as by dynamic programming computation of the partition function from UNAFOLD and RNAcofold; however, the advantage of the WL approach is that by extending the energy evaluation function for a given structure or hybridization, we can estimate the partition function for arbitrary pseudoknotted structures, known to be an NP-complete problem.

Before proceeding, we formally define a secondary structure as follows. Given an RNA sequence s = a1,…, an, a secondary structure S on s is defined to be a set of ordered pairs corresponding to base pair positions, which satisfies the following requirements.

  1. WatsonCrick or GU wobble pairs: if (i, j) belongs to S, then pair (ai, aj) must be one of the following canonical base pairs: (A, U), (U, A), (G, C), (C, G), (G, U) and (U, G).
  2. Threshold requirement: if (i, j) belongs to S, then ji > θ.
  3. Non-existence of pseudoknots: if (i, j) and (k, [ell]) belong to S, then it is not the case that i < k < j < [ell].
  4. No base triples: if (i, j) and (i, k) belong to S, then j = k; if (i, j) and (k, j) belong to S, then i = k.

For steric reasons, following convention, the threshold θ, or minimum number of unpaired bases in a hairpin loop, is taken to be three. For any additional background on RNA and dynamic programming computation of secondary structures, see Clote and Backofen (2000) and the recent review Eddy (2004).

2 APPROACH

The non-Boltzmannian WL Monte Carlo algorithm was developed by Wang and Landau, 2001a, b) to estimate the density of states and partition function for complex systems, such as the Ising model, that exhibit a phase transition. While the Metropolis-Hastings Monte Carlo algorithm samples low energy states, the WL algorithm is designed to visit states uniformly across all energies in a discrete energy landscape. Indeed, for the Metropolis–Hastings algorithm, the expected frequency, or stationary probability, pmc*(x) of visiting the state x, whose energy is E, is given by the uniform probability An external file that holds a picture, illustration, etc.
Object name is btq218i1.jpg times the Boltzmann probability An external file that holds a picture, illustration, etc.
Object name is btq218i2.jpg, where g(E) is the number of states having energy E and the partition function Z = ∑zeE(z)/RT; in contrast, for the WL algorithm, the expected frequency or stationary probability, of visiting state x is An external file that holds a picture, illustration, etc.
Object name is btq218i3.jpg, where x2130 is the total number of distinct energies E (in the discrete case), or of energy bins (in the continuous case). It follows that non-Boltzmannian sampling strategies, such as that devised by Wang and Landau (2001a, b), Kou and Wong Kou et al. (2006a), etc. are potentially useful in biopolymer folding, where one searches for a global energy minimum in a landscape having many local energy minima. Indeed in Chen and Xu (2006), Chen and Xu applied the WL algorithm for the structure prediction of helical transmembrane proteins, while the equi-energy sampling method of Kou and Wong Kou et al. (2006a), related to Monte Carlo with replica exchange, has been applied to estimate the density of states for lattice protein folding under the hydrophobic–hydrophilic (HP) energy model Kou et al. (2006b), as well as in protein structure prediction by the fragment assembly Zhang et al. (2009).

In this article, we apply the WL algorithm to compute the density of states and partition function for RNA secondary structure as well as for the hybridization of two RNA sequences. We begin by validating and benchmarking the WL method against the exhaustive method RNAsubopt Wuchty et al. (1999), that enumerates all secondary structures of a given RNA sequence. Next, we compute the partition function over all secondary structures and over all pseudoknot-free hybridizations. We describe as well how to compute the partition function Z(T) over all temperatures from 0°C to 100°C by performing two WL computations, followed by convolution calculations. Although the computation of the partition function over all secondary structures and over all pseudoknot-free hybridizations can be done using the existent software RNAfold (Hofacker, 2003), respectively, RNAcofold (Bernhart et al., 2006), UNAFold (Markham and Zuker, 2008) and a recently published method of Chitsaz et al., the real advantage of our method is that by adding a function to evaluate arbitary pseudoknotted structures and arbitrary hybridizations, we can approximately compute the partition function, heat capacity, melting temperature, etc. for a context known to be NP-complete Lyngso and Pedersen (2000).

The density of states is defined to be the absolute frequency function for energy; i.e. density of states g(e) counts the number of states having energy e. In the context of RNA secondary structure, a state is a secondary structure for an arbitrary but fixed RNA sequence s. In Cupal et al. (1996), described the first efficient algorithm, running in O(m2n3) time, to compute the density of states for an RNA sequence of length n, where energy is discretized into m bins. The program of Cupal et al. (1996) is no longer available, since it has been superceded by the program RNAsubopt, developed by Wuchty et al. (1999), which enumerates all secondary structures, whose free energy is within a user-defined bound above the minimum free energy. Though not documented, the RNAsubopt program additionally admits the option -D, which, instead of outputting structures, outputs only the number of secondary structures in each energy bin above the minimum free energy (bin size 0.1 kcal/mol).

3 METHODS

Monte Carlo algorithms have been implemented by a number of groups, to study RNA kinetics of folding. In particular, KinFold, developed by Flamm et al. (2000), computes the mean first passage time (MFPT) of folding, by using a variant of the Gillespie algorithm in an event-driven simulation with a choice of Metropolis–Hastings and Kawasaki dynamics. In Isambert and Siggia (2000) and Xayaphoummine et al. (2005) a similar time-driven Monte Carlo simulation program, KineFold, is described to compute kinetically determine pseudoknotted structure for a given RNA sequence. Danilova et al. (2006) describe the RNAkinetics web server used to study the kinetics of the folding transitions of a growing RNA molecule, as in the case of transcriptional folding.

We now begin by providing background definitions and describing the WL algorithm.

3.1 WL

The WL algorithm, (Wang and Landau, 2001a, b) was designed in order to compute the density of states and partition function, neither of which can be computed directly by classical Monte Carlo methods, such as the Metropolis–Hastings algorithm, simulated annealing, replica exchange, etc.

Recall the definition of Markov chain. Let Q = {1,…, n} be a finite set of states, let π = (p1,…, pn) be the distribution for initial state, and let P = (pi,j) be a matrix of transition probabilities, satisfying ∑jpi,j = 1 for all i. A (first-order, time-homogeneous) Markov chain M = (Q, π, P) is a stochastic process, whose state qt at time t is a random variable determined by

equation image

Define pi(t) = Pr[qt = i] and pi,j(t) = Pr[qt = j | q0 = i]. Clearly, the (i, j)-th entry of the t-th power Pt of P equals pi,j(t); moreover, by time-homogeneity it follows that pi,j(t) = Pr[qt0+t = j|qt0 = i], for all t0. The stationary probability of state i is defined by limtpi(t) = p*i, provided the limit exists. It is a classical result that every finite, aperiodic, irreducible Markov chain has an equilibrium distribution of stationary probabilities; see the text of Clote and Backofen (2000) for a new, self-contained proof of this result. A Markov chain with state set Q and stationary probabilities p*1,…, p*n is reversible, if for all i, j [set membership] Q, p*ipi,j = p*jpj,i.

Figure 2 presents pseudocode for the classical Metropolis–Hastings Monte Carlo algorithm with simulated annealing (Kirkpatrick et al., 1983; Metropolis et al., 1953), which implements a random walk on the Markov chain whose transition probabilities pi,j of moving from state xi to xj is given by

equation image
(1)

where [mathematical script N](xi) is the set of immediate neighbors of state xi and [mathematical script N](xj) the set of immediate neighbors of state xj; i.e. [mathematical script N](xi) is the set of states that can be reached by a single move from state xi. It can be proved that the stationary probabilities for this Markov chain are given by the Boltzmann probabilities An external file that holds a picture, illustration, etc.
Object name is btq218i4.jpg, as shown in Clote and Backofen (2000).

Fig. 2.
Pseudocode for Metropolis–Hastings algorithm with simulated annealing (Kirkpatrick et al., 1983).

In contrast, Figure 3 presents pseudocode for the WL algorithm, which implements a random walk on the Markov chain whose transition probabilities pi,j of moving from state xi to xj are given by

equation image
(2)

In this case, the stationary probability of state xi is are given by An external file that holds a picture, illustration, etc.
Object name is btq218i5.jpg.

Fig. 3.
Pseudocode for WL algorithm, as applied to RNA secondary structure density of states computation. In line 8, [mathematical script N](S) denotes the collection of immediate neighbors of structure S; i.e. those obtained by adding or removing a single base pair. In line ...

The mathematical–justification for applying the Metropolis-Hastings Monte Carlo method (Metropolis et al., 1953) to determine the minimum energy conformation of a biopolymer (Bradley et al., 2005; Das and Baker, 2007; Ortiz et al., 1998) depends on two facts: (i) every finite, irreducible, aperiodic Markov chain has a stationary probability distribution and (ii) if the Markov chain is reversible, a situation called detailed balance by the physics community, then the stationary distribution of the Markov chain corresponding to the Metropolis–Hastings algorithm is the Boltzmann distribution, defined by An external file that holds a picture, illustration, etc.
Object name is btq218i6.jpg, where E(x) is the energy of state (i.e. conformation) x, R is the universal gas constant 1.986 cal/mol, T is absolute temperature, and the partition function Z is defined by ∑x exp(−E(x)/RT, where the sum is taken over all states x in the Markov chain. As temperature T approaches zero, the Boltzmann probability of the minimum energy state approaches 1, in the case of a unique minimum energy state, or more generally 1/m, in the case of m distinct minimum energy states. See Clote and Backofen (2000) for details.

In contrast to the Metropolis–Hastings algorithm, which performs a random walk on the Markov chain of states (secondary structures), the WL algorithm performs a random walk on the energy space of the Markov chain of states (secondary structures), where the stationary probability of visiting energy ei is proportional to An external file that holds a picture, illustration, etc.
Object name is btq218i7.jpg, then the histogram of energies encountered in the random walk will be flat.

In this article, we consider the Markov chain, whose states are the secondary structures of a given RNA sequence, and for which permissible local moves correspond to the addition or removal of a single base pair (Flamm et al., 2000). Although detailed balance holds for the Metropolis–Hastings algorithm in Figure 2, it does not necessarily hold for the Metropolis algorithm, obtained by replacing line

An external file that holds a picture, illustration, etc.
Object name is btq218i15.jpg

by

An external file that holds a picture, illustration, etc.
Object name is btq218i16.jpg

Indeed for the case of RNA secondary structures, detailed balance does not hold in this situation, since if we define the stationary probability p*i for state xi to be the Boltzmann probability An external file that holds a picture, illustration, etc.
Object name is btq218i8.jpg, and the transition probabilities given by Equation (1), then it is not always the case that p*i · pi,j = p*j · pj,i. For instance, the empty structure S =………. on the 10-mer GGGGGCCCCC has 18 immediate neighbors, one of which is T = (……). The structure T has 11 immediate neighbors, one of which is the empty structure S. Letting xi = S and xj = T, we have E(xi) = 0 kcal/mol, E(xj) = 2.70 kcal/mol, ensemble free energy is −RTln(Z) = −3.96, hence Z = exp(3.96/RT) where T = 310°C so Z = 621.5 and we have stationary probabilities An external file that holds a picture, illustration, etc.
Object name is btq218i9.jpg, An external file that holds a picture, illustration, etc.
Object name is btq218i10.jpg, An external file that holds a picture, illustration, etc.
Object name is btq218i11.jpg and An external file that holds a picture, illustration, etc.
Object name is btq218i12.jpg. We compute that

equation image

Summarizing, in the Metropolis algorithm (with modified line 11), reversibility of a Markov chain depends on the permissible local moves, while in the Metropolis–Hastings algorithm (with line 11 as in Fig. 2), reversibility is always ensured. In the case at hand, if every secondary structure is an immediate neighbor of every secondary structure, then in the Metropolis algorithm, transition probabilities would be

equation image
(3)

where [mathematical script N] is the number of secondary structures. In this case, an easy computation shows that the Markov chain is reversible. Despite the non-reversible nature of the Markov chain corresponding to the Metropolis algorithm, whose states are the secondary structures of a given RNA sequence, and whose local moves consist of the addition or removal of a single base pair, it has been a standard practice to apply the Metropolis algorithm in this case (Danilova et al., 2006; Flamm et al., 2000; Isambert and Siggia, 2000; Xayaphoummine et al., 2005). For that reason, we do not hesitate to apply the WL algorithm for the study of RNA secondary structure formation.

Note that in Figure 3, the WL computes the relative density of states, defined by g(i) = N(ei)/N, where N(ei) is the number of states having energy ei and N is the total number of states. In the case of RNA secondary structures, it is simple to compute the total number of secondary structures by dynamic programming, given as follows. Given an RNA sequence of length n, let BPi,j = 1 if positions i,j can form a Watson–Crick or wobble pair, otherwise let BPi,j = 0. Let θ = 3 denote the minimum number of unpaired bases in a hairpin loop. Letting Ni,j denote the number of secondary structures on subsequence [i, j] of the given RNA sequence, we have that Ni,j = 0 if j < i + 3, and otherwise

equation image

It follows that the total number of secondary structures is then N1,n. From the relative density of states computed by WL algorithm, we compute the absolute density of states by

equation image

For fixed temperature T for which the WL computation was done, we can compute the partition function Z(T) = ∑S exp(−E(S)/RT) by

equation image
(4)

In their original article Wang and Landau (2001a, b) mentioned that in the case of the Ising model, Equation (4) allows one to compute the partition function at any desired temperature T from the density of states. Unfortunately, this is no longer the case for the Turner nearest neighbor model Xia et al., 1999 of RNA secondary structure, since the free energy parameters for stacked base pairs, hairpins, bulges, internal loops, etc. all depend on temperature. We can nevertheless proceed by computing the density of states for free energy at T = 37°C, and the density of states for enthalpy (assumed to be temperature independent), and then by convoluting these values, we obtain the density of states for free energy at any desired temperature.

3.2 Partition function for a single RNA

Figure 4a displays the relative density of states for the free energy of secondary structures of the 45 nt flavivirus capsid hairpin (cHP) with EMBL access code AB010982/1-45. Figure 4b displays the sum of squared differences between the density of states and the best fitting normal distribution, respectively, extreme value distribution. The cHP is a conserved RNA hairpin structure in the capsid-coding region of flavivirus genomes. Note that the relative density of states, or energy histogram, is approximately normal. In Clote et al. (2009) it is rigorously proved that the relative density of states is asymptotically normal; specifically, it is shown that the limit, as n approaches infinity, of the relative density of states for an RNA sequence of length n is normal, where for the purpose of mathematical analysis it is assumed that any base can pair with any other base (homopolymer model) and that the energy of a secondary structure S is −1 times the number of base pairs in S (Nussinov energy model; Nussinov and Jacobson, 1980).

Fig. 4.
(a) Density of states for free energy of secondary structures of the 45 nt flavivirus cHP with EMBL access code AB010982/1-45 and sequence AUGAACAACC AACGAAAAAG GACGGGAAAA CCGUCUAUCA AUAUG. Overlaid on the graph is the best fitting normal distribution ...

3.3 Partition function of hybridization

Following the approach in program RNAcofold of Bernhart et al. (2006), we can modify the WL program of Figure 3 to compute the density of states for all hybridizations of two RNA sequences, where both intermolecular and intramolecular base-pairing is allowed, provided that there are no pseudoknots.

In the case of the hybridization of twoRNA secondary structures, the first of length n and the second of length m, we can compute the total number of hybridizations as follows. Given an RNA sequence A = a1,…, an of length n and an RNA sequence B = b1,…, bm of length m, let HPi,j = 1 if positions ai, bj can hybridize, forming a Watson–Crick or wobble pair, otherwise let HPi,j = 0. For 1 ≤ i, jn, 1 ≤ k, [ell]m, let Hi,j;k,[ell] denote the number of hybridizations of the subsequence ai,…, aj with bk,…, b[ell]. From Equation (3), we can compute the number NAx,y, respectively, NBx,y of secondary structures on subsequence ax,…,ay of A, respectively, bx,…, by of B. If j < i or [ell] < k, then defined Hi,j;k,[ell] = 0; otherwise define Hi,j;k,[ell] by

equation image
(5)

It follows that the total number of pseudoknot-free hybridizations is then H1,n;1,m.5 The previous algorithm is clearly O(n4).

By considering the number of hybridizations to be the same as the number of secondary structures of a chimeric sequence, formed by concatenating A, B to form c1,…, cn+m = a1,…, an, b1,…, bm, we have an O(n3) algorithm, as follows. For 1 ≤ i, jn + m, if j < i or (1 ≤ i, jn, ji ≤ θ = 3), then Ni,j = 0, while if 1 ≤ in, n + 1 ≤ jn + m, then Ni,j = 1; otherwise Ni,j is equal to

equation image

It follows that the total number of hybridizations is then N1,n.

We now describe how to compute the melting temperature TM of hybridization.

  1. Compute number of structures for each of five species (temperature independent): S(A), S(B), S(AA), S(BB) and S(AB).
  2. For temperature T [set membership] {0°C,…, 100°C}, compute relative density of states f(A, T), f(B, T), f(AA, T), f(BB, T) and f(AB, T) for each species by WL.
  3. For temperature T [set membership] {0°C,…, 100°C}, compute partition functions Z(A, T), Z(B, T), Z(AA, T), Z(BB, T) and Z(AB, T) by
    equation image
    where absolute density of states g(E) is relative density times number of structures. For instance
    equation image
  4. Following Dimitrov and Zuker (2004), for temperature T [set membership] {0°C,…, 100°C}, compute ensemble free energy ΔG(A, T), ΔG(B, T), ΔG(AA, T), ΔG(BB, T) and ΔG(AB, T). This involves the following.
    1. Redundancy correction:
      equation image
    2. Symmetry correction:
      equation image
    3. Temperature-dependent chemical equilibrium constants:
      equation image
    4. Temperature-dependent concentration (number) of molecules A and B:
      equation image
      where NA0, NB0 are given and KA, KB, KAB are obtained from the previous step. Values NA and NB are gotten by using, for example, Newton's method for solving two nonlinear functions; due to issues of numerical instability, Markham uses binary search (p. 43 of Markham, 2006).
    5. Letting Z(A, B, AB, AA, BB) equal the following expression:
      equation image
      it follows that the total partition function Z satisfies
      equation image
      which can be approximated by the term Z(A, B, AB, AA, BB) where NA, NB, NAB, NAA, NBB obtained as previously explained. The chemical potential μX for each species X is the partial derivative An external file that holds a picture, illustration, etc.
Object name is btq218i13.jpg of ensemble free energy with respect to number of molecules of X, hence
      equation image
      so
      equation image
      Total free energy satisfies
      equation image
      which simplifies to
      equation image
    6. Normalize the ensemble free energy in terms of energy per mole of solute:
      equation image
  5. Determine heat capacity as a function of temperature by
    equation image
    by computing the second partial of a fitting parabola determined by 2m+1 evenly spaced points, using the approximation for An external file that holds a picture, illustration, etc.
Object name is btq218i14.jpg given by
    equation image
    In a post-processing step, smooth the heat capacity curve by computing a running average. The melting temperature TM(Cp) is computed by determining the temperature at which heat capacity achieves a maximum.

4 DISCUSSION

The Figure 5a displays the run time of the WL method, compared with that of RNAsubopt from the Vienna RNA package, while the Figure 5b of the same figure shows sample output from our WL program. Figure 5 clearly shows the advantage of WL over existent methods in computing the density of states for both single RNA molecules and for hybridization complexes of two RNA molecules. Figure 6a and b depicts the heat capacity computed by the WL method (Fig. 5a) and the program UNAFold (Fig. 5b). Melting temperature, which is usually defined as the temperature at which half of the molecules are single-stranded, while the other half are hybridized, is determined as that temperature where heat capacity achieves its maximum. The program UNAFold does not allow any intramolecular structure (base pairing between 2 nt of the same structure), a feature that our WL method permits, as does the RNAcofold program. While it is clear that additional work must be done to improve heat capacity computation with the WL method, the melting temperature TM computed by WL agrees reasonably well with that computed by O(n3) methods UNAFold, RNAcofold, and the recent O(n6) method of Chitsaz et al. (2009) each of which methods admits slightly different interactions.

Fig. 5.
(a) Comparison of execution times of WL and program RNAsubopt-D (Wuchty et al., 1999), in computing density of states. Since the program of Cupal et al. (1996) is no longer publicly available, and is superceded by RNAsubopt (private correspondence from. ...
Fig. 6.
Computation of heat capacity cP(T) for the toy sequence 5′-AGCGA-3′, hybridized to its reverse complement 3′-UCGCU-5′. (a) Graph generated by WL method described in this article. (b) Graph generated by the program UNAFold ...

We now describe how to approximately compute the partition function Z(T) over all secondary structures and over all pseudoknot-free hybridizations, simultaneously over all temperatures from 0°C to 100°C, by performing two WL computations, followed by a computation of the convolution of enthalpy relative frequency with free-energy relative frequency. Similar computations using existent methods require over 100 cubic time computations.

  • Compute the relative density of states ph for free energy using WL with temperature T = −273°C (absolute zero Kelvin). It follows that ph is the relative density of states for enthalpy, Due to the fundamental thermodynamic relation
    equation image
    (6)
    where T (K) is absolute temperature and ΔG, ΔH, ΔS, respectively, denote the change in free energy, enthalpy and entropy.
  • Compute the relative density of states pg for free energy using WL with temperature T = 37°C (310 K).
  • From Equation (6), we have that
    equation image
  • Given arbitrary absolute temperature T, compute the relative density of states for free energy at temperature T by the following pseudocode, representing a kind of convolution of pg with ph. An external file that holds a picture, illustration, etc.
Object name is btq218i17.jpg
  • Compute the absolute density of states g(z) = p(z) · N, where N is the total number of secondary structures, computed by Equation (3).

By this method, one can approximate the partition function Z(T) for all temperatures from 0°C to 100°C, by performing two WL sampling runs, respectively, at temperatures −373°C and 37°C, and then to repeatedly perform a fast convolution. The method just described, which involves two WL computations, together with convolution computations, has until now not worked well in practice, for certain technical reasons. This direction needs further exploration.

Another issue concerning any sampling method is the required time to obtain reasonably good estimates of the quantity in question. In the case of RNA kinetics, computations of MFPT to reach the minimum free-energy structure take inordinate amounts of time, when using Metropolis–Hastings Monte Carlo methods, which are time-driven simulations. For this reason, the program KinFold (Flamm et al., 2001) uses an event-driven simulation, where time is incremented by an exponentially distributed random variable. It may be possible to use similar ideas to increase efficiency of our WL program, which should further improve the accuracy in the computation of heat capacity. Finally, we intend to implement a new energy evaluation function, that allows arbitrary pseudoknots, zig-zags, etc. using energy parameters from the recent dissertation of Bon (Bon, 2009). This will allow us to estimate the partition function, ensemble free energy, heat capacity, melting temperature, etc. for a context known to be NP-complete.

5 CONCLUSION

In this article, we have implemented the WL algorithm to compute the relative density of states for RNA secondary structures and hybridizations. Separately computing the number of structures and hybridizations, we obtain the absolute density of states, which then yields the partition function, and thence, in the case of hybridization, the melting temperature. The WL method is much faster than existent software RNAsubopt in computing the density of states, but could not be benchmarked with the binning method of Cupal et al. (1996) which runs in O(m2n3) time, for length n sequence and m energy bins, since the latter software is no longer available, being superceded by RNAsubopt-D. In preliminary tests, we obtain roughly the same melting temperature for duplex RNA, as that computed by existent methods; however, the real advantage of the WL method is that there is no restriction on types of allowed interaction, unlike the situation with dynamic programming approaches that disallow pseudoknots, zig-zags, etc.

ACKNOWLEDGEMENTS

We would like to thank Michael Zuker, for a discussion on melting temperature and Ivo Hofacker, for informing us of the status of the program of Cupal et al. (1996), and for explaining the undocumented option -D of RNAsubopt, which computes the density of states from the enumeration of all secondary structures within a certain energy range of the minimum free energy. As well, thanks to three anonymous reviewers for their suggestions.

Funding: Digiteo Foundation (to P.C. and F.L.), in the form of a Digiteo Chair of Excellence to P.C. National Science Foundation (grants DMS-0817971 and DBI-0543506 to P.C.).

Conflict of Interest: none declared.

Footnotes

1A base triple in S consists of two base pairs (i, j), (i, [ell]) [set membership] S or (i, j), (k, j) [set membership] S. A pseudoknot in S consists of two base pairs (i, j), (k, [ell]) [set membership] S with i < k < j < [ell].

2According to Reinisch and Wolin (2007), one of the functions of Y RNA is to bind to certain misfolded RNAs, including 5S rRNA, as part of a quality control mechanism.

3There is some controversy about the extent to which RNA secondary structure constrains the tertiary structure. See Cho et al. (2009) for more on this point.

4The minimum energy pseudoknotted structure can be computed by maximum weight matching in O(n3) time for the simple Nussinov energy model (Tabaska et al., 1998).

5In the literature, various types of hybridization are allowed. In Dimitrov-Zuker (2004), no intramolecular structure is allowed, while in Bernhardt et al. (2006) pseudoknot-free hybridizations are allowed with intramolecular structure.

REFERENCES

  • Abrahams JP, et al. Prediction of RNA secondary structure, including pseudoknotting, by computer simulation. Nucleic Acids Res. 1990;18:3035–3044. [PMC free article] [PubMed]
  • Banerjee AR, et al. Thermal unfolding of a group I ribozyme: The low-temperature transition is primarily disruption of tertiary structure. Biochemistry. 1993;32:153–163. [PubMed]
  • Bekaert M, et al. Towards a computational model for −1 eukaryotic frameshifting sites. Bioinformatics. 2003;19:327–335. [PubMed]
  • Bernhart SH, et al. Partition function and base pairing probabilities of RNA heterodimers. Algorithms. Mol. Biol. 2006;1:3. [PMC free article] [PubMed]
  • Böck A, et al. Selenoprotein synthesis: An expansion of the genetic code. Trends Biochem. Sci. 1991;16:463–467. [PubMed]
  • Bon M. Prédiction de structures secondaires d'ARN avec pseudo-noeuds. PhD thesis, Ecole Polytechnique, 2009. Paris: Ph.D. dissertation in Physics; 2009.
  • Bradley P, et al. Toward high-resolution de novo structure prediction for small proteins. Science. 2005;309:1868–1871. [PubMed]
  • Cheah MT, et al. Control of alternative RNA splicing and gene expression by eukaryotic riboswitches. Nature. 2007;447:497–500. [PubMed]
  • Chen Z, Xu Y. Structure prediction of helical transmembrane proteins at two length scales. J. Bioinform. Comput. Biol. 2006;4:317–333. [PubMed]
  • Chitsaz H, et al. A partition function algorithm for interacting nucleic acid strands. Bioinformatics. 2009;25:i365–i373. [PMC free article] [PubMed]
  • Cho SS, et al. Assembly mechanisms of RNA pseudoknots are determined by the stabilities of constituent secondary structures. Proc. Natl Acad. Sci. USA. 2009;106:17349–17354. [PubMed]
  • Chowdhury S, et al. Temperature-controlled structural alterations of an RNA thermometer. J. Biol. Chem. 2003;278:47915–47921. [PubMed]
  • Clote P, Backofen R. Computational Molecular Biology: An Introduction. Chichester, New York, Weinheim, Brisbane, Singapore, Toronto: John Wiley & Sons; 2000. p. 279.
  • Clote P, et al. Asymptotics of canonical and saturated RNA secondary structures. J. Bioinform. Comput. Biol. 2009;7:869–893. [PubMed]
  • Cupal J, et al. Dynamic programming algorithm for the density of states of RNA secondary structures. In: Hofstädt R, Lengauer T, Löffler M, Schomburg D, editors. Computer Science and Biology 96 (Prooceedings of the German Conference on Bioinformatics) Univ. Leipzig.; 1996. pp. 184–186.
  • Danilova LV, et al. RNAKinetics: a web server that models secondary structure kinetics of an elongating RNA. J. Bioinform. Comput. Biol. 2006;4:589–596. [PubMed]
  • Das R, Baker D. Automated de novo prediction of native-like RNA tertiary structures. Proc. Natl Acad. Sci. USA. 2007;104:14664–14669. [PubMed]
  • Dimitrov RA, Zuker M. Prediction of hybridization and melting for double-stranded nucleic acids. Biophys. J. 2004;87:215–226. [PubMed]
  • Dirks RM, et al. Thermodynamic analysis of interacting nucleic acid strands. SIAM Rev. 2007;49:65–88.
  • Dirks RM, Pierce NA. A partition function algorithm for nucleic acid secondary structure including pseudoknots. J. Comput. Chem. 2003;24:1664–1677. [PubMed]
  • Eddy SR. How do RNA folding algorithms work? Nat. Biotechnol. 2004;22:1457–1458. [PubMed]
  • Flamm C, et al. RNA folding at elementary step resolution. RNA. 2000;6:325–338. [PubMed]
  • Flamm C, et al. Design of multistable RNA molecules. RNA. 2001;7:254–265. [PubMed]
  • Griffiths-Jones S, et al. Rfam: an RNA family database. Nucleic Acids Res. 2003;31:439–441. [PMC free article] [PubMed]
  • Hofacker IL, et al. Fast folding and comparison of RNA secondary structures. Monatsh. Chem. 1994;125:167–188.
  • Hofacker IL. Vienna RNA secondary structure server. Nucleic Acids Res. 2003;31:3429–3431. [PMC free article] [PubMed]
  • Isambert H, Siggia ED. Modeling RNA folding paths with pseudoknots: application to hepatitis delta virus ribozyme. Proc. Natl Acad. Sci. USA. 2000;97:6515–6520. [PubMed]
  • Kirkpatrick S, et al. Optimization by simulated annealing. Science. 1983;220:671–680. [PubMed]
  • Knudsen B, Hein J, et al. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res. 2003;31:3423–3428. [PMC free article] [PubMed]
  • Kou SC, et al. Equi-energy sampler with applications in statistical inference and statistical mechanics. Ann. Stat. 2006a;34:1581–1652.
  • Kou SC, et al. A study of density of states and ground states in hydrophobic-hydrophilic protein folding models by equi-energy sampling. J. Chem. Phys. 2006b;124:244903. [PubMed]
  • Lefebvre F. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. 1995. An optimized parsing algorithm well-suited to rna folding. In AAAI press, editor; pp. 222–230. [PubMed]
  • Lim LP, et al. Vertebrate microRNA genes. Science. 2003;299:1540. [PubMed]
  • Lyngso RB, Pedersen CN. RNA pseudoknot prediction in energy-based models. J. Comput. Biol. 2000;7:409–427. [PubMed]
  • Mandal M, et al. Riboswitches control fundamental biochemical pathways in Bacillus subtilis and other bacteria. Cell. 2003;113:577–586. [PubMed]
  • Markham NR, Zuker M. UNAFold: software for nucleic acid folding and hybridization. Methods Mol. Biol. 2008;453:3–31. [PubMed]
  • Markham NR. Algorithms and software for nucleic acid sequences. Troy, New York: 2006.
  • Mathews DH, et al. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc. Natl Acad. Sci. USA. 2004;101:7287–7292. [PubMed]
  • Metropolis N, et al. Equation of state calculations by fast computing machines. J. Chem. Phys. 1953;21:1087–1092.
  • Metzler D, Nebel ME. Predicting RNA secondary structures with pseudoknots by MCMC sampling. J. Math. Biol. 2008;56:161–181. [PubMed]
  • Mückstein U, et al. Thermodynamics of RNA-RNA binding. Bioinformatics. 2006;22:1177–1182. [PubMed]
  • Nawrocki EP, et al. Infernal 1.0: inference of RNA alignments. Bioinformatics. 2009;25:1335–1337. [PMC free article] [PubMed]
  • Nussinov R, Jacobson AB. Fast algorithm for predicting the secondary structure of single stranded RNA. Proc. Natl Acad. Sci. USA. 1980;77:6309–6313. [PubMed]
  • Omer AD, et al. Homologues of small nucleolar RNAs in Archaea. Science. 2000;288:517–522. [PubMed]
  • Ortiz AR, et al. Fold assembly of small proteins using Monte Carlo simulations driven by restraints derived from multiple sequence alignments. J. Mol. Biol. 1998;277:419–448. [PubMed]
  • Reeder J, Giegerich R. Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC. Bioinformatics. 2004;5:104. [PMC free article] [PubMed]
  • Reinisch KM, Wolin SL. Emerging themes in non-coding RNA quality control. Curr. Opin. Struct. Biol. 2007;17:209–214. [PubMed]
  • Ren J, et al. Hotknots: heuristic prediction of RNA secondary structures including pseudoknots. RNA. 2005;11:1494–1504. [PubMed]
  • Rivas E, Eddy SR. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J. Mol. Biol. 1999;285:2053–2068. [PubMed]
  • Tabaska JE, et al. An RNA folding method capable of identifying pseudoknots and base triples. Bioinformatics. 1998;14:691–699. [PubMed]
  • Tucker BJ, Breaker RR. Riboswitches as versatile gene control elements. Curr. Opin. Struct. Biol. 2005;15:342–348. [PubMed]
  • Van Batenburg FH, et al. Pseudobase: structural information on RNA pseudoknots. Nucleic Acids Res. 2001;29:194–195. [PMC free article] [PubMed]
  • Wang F, Landau DP. Determining the density of states for classical statistical models: a random walk algorithm to produce a flat histogram. Phys. Rev. E. 2001a;64:056101(1)–056101(16). [PubMed]
  • Wang F, Landau DP. Efficient, multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Lett. 2001b;86:2050–2053. [PubMed]
  • Weinger JS, et al. Substrate-assisted catalysis of peptide bond formation by the ribosome. Nat. Struct. Mol. Biol. 2004;11:1101–1106. [PubMed]
  • Wiese KC, et al. JViz.Rna–a Java tool for RNA secondary structure visualization. IEEE. Trans. Nanobiosci. 2005;4:212–218. [PubMed]
  • Wuchty S, et al. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers. 1999;49:145–164. [PubMed]
  • Xayaphoummine A, et al. Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots. Nucleic Acids Res. 2005;33:W605–W610. [PMC free article] [PubMed]
  • Xia T., Jr., et al. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry. 1999;37:14719–14735. [PubMed]
  • Zhang J, et al. Biopolymer structure simulation and optimization via fragment regrowth Monte Carlo. J. Chem. Phys. 2007;126:225101. [PubMed]
  • Zhao J, et al. Rapid ab initio prediction of RNA pseudoknots via graph tree decomposition. J. Math. Biol. 2008;56:145–159. [PubMed]
  • Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981;9:133–148. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press