Home | About | Journals | Submit | Contact Us | Français |

**|**Bioinformatics**|**PMC2881402

Formats

Article sections

Authors

Related links

Bioinformatics. 2010 June 15; 26(12): i278–i286.

Published online 2010 June 1. doi: 10.1093/bioinformatics/btq218

PMCID: PMC2881402

* To whom correspondence should be addressed.

Copyright © The Author(s) 2010. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

**Motivation:** Thermodynamics-based dynamic programming RNA secondary structure algorithms have been of immense importance in molecular biology, where applications range from the detection of novel selenoproteins using expressed sequence tag (EST) data, to the determination of microRNA genes and their targets. Dynamic programming algorithms have been developed to compute the minimum free energy secondary structure and partition function of a given RNA sequence, the minimum free-energy and partition function for the hybridization of two RNA molecules, etc. However, the applicability of dynamic programming methods depends on disallowing certain types of interactions (pseudoknots, zig-zags, etc.), as their inclusion renders structure prediction an nondeterministic polynomial time (NP)-complete problem. Nevertheless, such interactions have been observed in X-ray structures.

**Results:** A non-Boltzmannian Monte Carlo algorithm was designed by Wang and Landau to estimate the density of states for complex systems, such as the Ising model, that exhibit a phase transition. In this article, we apply the Wang-Landau (WL) method to compute the density of states for secondary structures of a given RNA sequence, and for hybridizations of two RNA sequences. Our method is shown to be much faster than existent software, such as `RNAsubopt`. From density of states, we compute the partition function over all secondary structures and over all pseudoknot-free hybridizations. The advantage of the WL method is that by adding a function to evaluate the free energy of arbitary pseudoknotted structures and of arbitrary hybridizations, we can estimate thermodynamic parameters for situations known to be NP-complete. This extension to pseudoknots will be made in the sequel to this article; in contrast, the current article describes the WL algorithm applied to pseudoknot-free secondary structures and hybridizations.

**Availability:** The *WL RNA hybridization* web server is under construction at `http://bioinformatics.bc.edu/clotelab/`.

**Contact:** ude.cb@etolc

RNA is an important biomolecule, now known to play both an *information carrying* role, as well as a *catalytic* role. Indeed, the genomic information of retroviruses, such as the hepatitis C and human immunodeficiency viruses, is encoded by RNA rather than DNA, while the peptidyl transferase reaction, arguably the most important enzymatic reaction responsible for life, is catalyzed not by a protein, but rather by RNA (Weinger *et al.*, 2004). It has recently emerged that RNA plays a wide range of previously unsuspected roles in many biological processes, including *retranslation* of the genetic code [selenocysteine insertion (Böck *et al.*, 1991), ribosomal frameshift (Bekaert *et al.*, 2003)], transcriptional and translational gene regulation (Lim *et al.*, 2003; Mandal *et al.*, 2003), temperature-sensitive conformational switches (Chowdhury *et al.*, 2003; Tucker and Breaker, 2005), chemical modification of specific nucleotides in the ribosome (Omer *et al.*, 2000), regulation of alternative splicing (Cheah *et al.*, 2007), etc.

A secondary structure for a given RNA nucleotide sequence *a*_{1},…, *a*_{n} is a set *S* of base pairs (*i*, *j*), such that *a*_{i}, *a*_{j} forms either a Watson–Crick or GU (wobble) base pair, and such that there are no *base triples* or *pseudoknots* in *S*.^{1} For example, the secondary structure of Y RNA^{2} with EMBL access code AAPY01489510/220-119 is displayed in Figure 1a and b, while Figure 1c and d depicts the pseudoknotted structure of the Gag/pro ribosomal frameshift site of mouse mammary tumor virus (Van Batenburg *et al.*, 2001). In conventional dot-bracket notation, this latter structure is given as follows, where it should be noted that two kinds of bracket are needed due to the pseudoknot

(**a** and **b**) Pseudoknot-free secondary structure of Y RNA with EMBL access code AAPY01489510/220-119, depicted in (**a**) in Feynman circular form, and in panel (**b**) in classical form. (**c** and **d**) Pseudoknotted structure for the Gag/pro ribosomal frameshift site **...**

It is computationally intractable to compute the minimum free-energy tertiary structure of RNA; indeed, determining the optimal pseudoknotted structure is nondeterministic polynomial time (NP)-complete Lyngso and Pedersen (2000). In contrast, by disallowing pseudoknots, secondary structure prediction is algorithmically tractable; there are dynamic programming algorithms to compute the minimum free-energy structure for a single RNA molecule, as well as for the hybridization of two or more RNA molecules. In particular, such methods can be loosely grouped into two types of algorithm—those that use (i) *a stochastic context free grammar* to compute a covariation model and (ii) free-energy parameters obtained from UV absorbance (optical melting) experiments, in order to determine the minimum free energy structure (i.e. thermodynamic-based algorithms). Examples of stochastic context-free grammars are the programs *Infernal* (Nawrocki *et al.*, 2009) and *Pfold* (Knudsen and Hein *et al.*, 2003). Examples of thermodynamics-based algorithms are the programs `mfold` (Zuker and Stiegler, 1981), `UNAFOLD` (Markham and Zuker, 2008), `RNAfold` (Hofacker *et al.*, 1994), `RNAstructure` (Mathews *et al.*, 2004). Thermodynamics-based algorithms for hybridization of two structures are given in `UNAFOLD` (Dimitrov and Zuker, 2004), `RNAcofold` (Bernhart *et al.*, 2006; Mückstein *et al.*, 2006), while the `NUPACK` software considers hybridization of three or more RNA molecules. (Dirks *et al.*, 2007). Such thermodynamics-based algorithms are particularly important, since the tertiary structure of RNA is believed to be largely determined by secondary structure, which acts as a scaffold for tertiary contacts; see Banerjee *et al.* (1993) for experimental data supporting this view.^{3} Computing the minimum free-energy pseudoknotted structure for a given RNA sequence is NP-complete Lyngso and Pedersen (2000) for the Turner nearest neighbor energy model.^{4} For that reason, pseudoknot structure prediction algorithms fall into three categories: (i) exponential time *exact* algorithms, (ii) dynamic programming algorithms that restrict pseudoknots to a particular class and (iii) heuristic methods. Examples of exact algorithms for pseudoknot structure prediction are the branch-and-bound algorithm of (Bon, 2009) and the method using tree-width decomposition of Zhao *et al.* (2008). Examples of algorithms that consider only pseudoknots of a particular class are found in the pioneering work of Rivas and Eddy (1999) and Lefebvre (1995), with subsequent refinements in Dirks and Pierce (2003); Reeder and Giegerich (2004) and Ren *et al.* (2005) Examples of heuristic approaches include Monte Carlo methods Metzler and Nebel (2008), genetic algorithms Abrahams *et al.* (1990) and a simple, yet elegant algorithm called `ProbKnot` (D.H. Mathews, to appear) that appears to be the state-of-the art method according to recent benchmarking studies. Finally, it is beyond the scope of this article to provide additional background on algorithms for RNA structural alignment, motif detection or tertiary structure prediction.

As will be shown later, by Wang-Landau (WL) Monte Carlo methods, we can obtain essentially the same results as by dynamic programming computation of the partition function from `UNAFOLD` and `RNAcofold`; however, the advantage of the WL approach is that by extending the energy evaluation function for a given structure or hybridization, we can estimate the partition function for arbitrary pseudoknotted structures, known to be an NP-complete problem.

Before proceeding, we formally define a secondary structure as follows. Given an RNA sequence **s** = *a*_{1},…, *a*_{n}, a secondary structure *S* on **s** is defined to be a set of ordered pairs corresponding to base pair positions, which satisfies the following requirements.

*Watson*–*Crick or GU wobble pairs:*if (*i*,*j*) belongs to*S*, then pair (*a*_{i},*a*_{j}) must be one of the following canonical base pairs: (*A*,*U*), (*U*,*A*), (*G*,*C*), (*C*,*G*), (*G*,*U*) and (*U*,*G*).*Threshold requirement:*if (*i*,*j*) belongs to*S*, then*j*−*i*> θ.*Non-existence of pseudoknots:*if (*i*,*j*) and (*k*, ) belong to*S*, then it is not the case that*i*<*k*<*j*< .*No base triples:*if (*i*,*j*) and (*i*,*k*) belong to*S*, then*j*=*k*; if (*i*,*j*) and (*k*,*j*) belong to*S*, then*i*=*k*.

For steric reasons, following convention, the threshold θ, or minimum number of unpaired bases in a hairpin loop, is taken to be three. For any additional background on RNA and dynamic programming computation of secondary structures, see Clote and Backofen (2000) and the recent review Eddy (2004).

The non-Boltzmannian WL Monte Carlo algorithm was developed by Wang and Landau, 2001a, b) to estimate the density of states and partition function for complex systems, such as the Ising model, that exhibit a phase transition. While the Metropolis-Hastings Monte Carlo algorithm samples low energy states, the WL algorithm is designed to visit states uniformly across all energies in a discrete energy landscape. Indeed, for the Metropolis–Hastings algorithm, the expected frequency, or *stationary probability*, *p*_{mc}^{*}(*x*) of visiting the state *x*, whose energy is *E*, is given by the uniform probability times the Boltzmann probability , where *g*(*E*) is the number of states having energy *E* and the partition function *Z* = ∑_{z}*e*^{−E(z)/RT}; in contrast, for the WL algorithm, the expected frequency or stationary probability, of visiting state *x* is , where is the total number of distinct energies *E* (in the discrete case), or of energy bins (in the continuous case). It follows that non-Boltzmannian sampling strategies, such as that devised by Wang and Landau (2001a, b), Kou and Wong Kou *et al.* (2006a), etc. are potentially useful in biopolymer folding, where one searches for a global energy minimum in a landscape having many local energy minima. Indeed in Chen and Xu (2006), Chen and Xu applied the WL algorithm for the structure prediction of helical transmembrane proteins, while the equi-energy sampling method of Kou and Wong Kou *et al.* (2006a), related to Monte Carlo with replica exchange, has been applied to estimate the density of states for lattice protein folding under the hydrophobic–hydrophilic (HP) energy model Kou *et al.* (2006b), as well as in protein structure prediction by the fragment assembly Zhang *et al.* (2009).

In this article, we apply the WL algorithm to compute the density of states and partition function for RNA secondary structure as well as for the hybridization of two RNA sequences. We begin by validating and benchmarking the WL method against the exhaustive method `RNAsubopt` Wuchty *et al.* (1999), that enumerates all secondary structures of a given RNA sequence. Next, we compute the partition function over all secondary structures and over all pseudoknot-free hybridizations. We describe as well how to compute the partition function *Z*(*T*) over all temperatures from 0°C to 100°C by performing two WL computations, followed by convolution calculations. Although the computation of the partition function over all secondary structures and over all pseudoknot-free hybridizations can be done using the existent software `RNAfold` (Hofacker, 2003), respectively, `RNAcofold` (Bernhart *et al.*, 2006), `UNAFold` (Markham and Zuker, 2008) and a recently published method of Chitsaz et al., the real advantage of our method is that by adding a function to evaluate arbitary pseudoknotted structures and arbitrary hybridizations, we can approximately compute the partition function, heat capacity, melting temperature, etc. for a context known to be NP-complete Lyngso and Pedersen (2000).

The *density of states* is defined to be the *absolute frequency* function for energy; i.e. density of states *g*(*e*) counts the number of states having energy *e*. In the context of RNA secondary structure, a *state* is a secondary structure for an arbitrary but fixed RNA sequence **s**. In Cupal *et al.* (1996), described the first efficient algorithm, running in *O*(*m*^{2}*n*^{3}) time, to compute the density of states for an RNA sequence of length *n*, where energy is discretized into *m* bins. The program of Cupal *et al.* (1996) is no longer available, since it has been superceded by the program `RNAsubopt`, developed by Wuchty *et al.* (1999), which enumerates all secondary structures, whose free energy is within a user-defined bound above the minimum free energy. Though not documented, the `RNAsubopt` program additionally admits the option `-D`, which, instead of outputting structures, outputs only the number of secondary structures in each energy bin above the minimum free energy (bin size 0.1 kcal/mol).

Monte Carlo algorithms have been implemented by a number of groups, to study RNA kinetics of folding. In particular, `KinFold`, developed by Flamm *et al.* (2000), computes the *mean first passage time* (MFPT) of folding, by using a variant of the Gillespie algorithm in an event-driven simulation with a choice of Metropolis–Hastings and Kawasaki dynamics. In Isambert and Siggia (2000) and Xayaphoummine *et al.* (2005) a similar time-driven Monte Carlo simulation program, `KineFold`, is described to compute kinetically determine pseudoknotted structure for a given RNA sequence. Danilova *et al.* (2006) describe the `RNAkinetics` web server used to study the kinetics of the folding transitions of a growing RNA molecule, as in the case of transcriptional folding.

We now begin by providing background definitions and describing the WL algorithm.

The WL algorithm, (Wang and Landau, 2001a, b) was designed in order to compute the *density of states* and *partition function*, neither of which can be computed directly by classical Monte Carlo methods, such as the Metropolis–Hastings algorithm, simulated annealing, replica exchange, etc.

Recall the definition of Markov chain. Let *Q* = {1,…, *n*} be a finite set of states, let π = (*p*_{1},…, *p*_{n}) be the distribution for initial state, and let *P* = (*p*_{i,j}) be a matrix of transition probabilities, satisfying ∑_{j}*p*_{i,j} = 1 for all *i*. A (first-order, time-homogeneous) *Markov chain M* = (*Q*, π, *P*) is a stochastic process, whose state *q*_{t} at time *t* is a random variable determined by

Define *p*_{i}(*t*) = *Pr*[*q*_{t} = *i*] and *p*_{i,j}^{(t)} = *Pr*[*q*_{t} = *j* | *q*_{0} = *i*]. Clearly, the (*i*, *j*)-th entry of the *t*-th power *P*^{t} of *P* equals *p*_{i,j}^{(t)}; moreover, by time-homogeneity it follows that *p*_{i,j}^{(t)} = *Pr*[*q*_{t0+t} = *j*|*q*_{t0} = *i*], for all *t*_{0}. The *stationary probability* of state *i* is defined by lim_{t}*p*_{i}(*t*) = *p*^{*}_{i}, provided the limit exists. It is a classical result that every finite, aperiodic, irreducible Markov chain has an *equilibrium* distribution of *stationary* probabilities; see the text of Clote and Backofen (2000) for a new, self-contained proof of this result. A Markov chain with state set *Q* and stationary probabilities *p*^{*}_{1},…, *p*^{*}_{n} is *reversible*, if for all *i*, *j* *Q*, *p*^{*}_{i}*p*_{i,j} = *p*_{*}_{j}*p*_{j,i}.

Figure 2 presents pseudocode for the classical Metropolis–Hastings Monte Carlo algorithm with simulated annealing (Kirkpatrick *et al.*, 1983; Metropolis *et al.*, 1953), which implements a random walk on the Markov chain whose transition probabilities *p*_{i,j} of moving from state *x*_{i} to *x*_{j} is given by

(1)

where (*x*_{i}) is the set of immediate neighbors of state *x*_{i} and (*x*_{j}) the set of immediate neighbors of state *x*_{j}; i.e. (*x*_{i}) is the set of states that can be reached by a single move from state *x*_{i}. It can be proved that the stationary probabilities for this Markov chain are given by the Boltzmann probabilities , as shown in Clote and Backofen (2000).

In contrast, Figure 3 presents pseudocode for the WL algorithm, which implements a random walk on the Markov chain whose transition probabilities *p*_{i,j} of moving from state *x*_{i} to *x*_{j} are given by

(2)

In this case, the stationary probability of state *x*_{i} is are given by .

Pseudocode for WL algorithm, as applied to RNA secondary structure density of states computation. In line 8, (*S*) denotes the collection of immediate neighbors of structure *S*; i.e. those obtained by adding or removing a single base pair. In line **...**

The mathematical–justification for applying the Metropolis-Hastings Monte Carlo method (Metropolis *et al.*, 1953) to determine the minimum energy conformation of a biopolymer (Bradley *et al.*, 2005; Das and Baker, 2007; Ortiz *et al.*, 1998) depends on two facts: (i) every finite, irreducible, aperiodic Markov chain has a *stationary probability* distribution and (ii) if the Markov chain is *reversible*, a situation called *detailed balance* by the physics community, then the stationary distribution of the Markov chain corresponding to the Metropolis–Hastings algorithm is the *Boltzmann distribution*, defined by , where *E*(*x*) is the energy of state (i.e. conformation) *x*, *R* is the universal gas constant 1.986 cal/mol, *T* is absolute temperature, and the *partition function Z* is defined by ∑_{x} exp(−*E*(*x*)/*RT*, where the sum is taken over all states *x* in the Markov chain. As temperature *T* approaches zero, the Boltzmann probability of the minimum energy state approaches 1, in the case of a unique minimum energy state, or more generally 1/*m*, in the case of *m* distinct minimum energy states. See Clote and Backofen (2000) for details.

In contrast to the Metropolis–Hastings algorithm, which performs a random walk on the Markov chain of states (secondary structures), the WL algorithm performs a random walk on the *energy space* of the Markov chain of states (secondary structures), where the stationary probability of visiting energy *e*_{i} is proportional to , then the histogram of energies encountered in the random walk will be flat.

In this article, we consider the Markov chain, whose states are the secondary structures of a given RNA sequence, and for which permissible local moves correspond to the addition or removal of a single base pair (Flamm *et al.*, 2000). Although detailed balance holds for the Metropolis–Hastings algorithm in Figure 2, it does not necessarily hold for the Metropolis algorithm, obtained by replacing line

by

Indeed for the case of RNA secondary structures, detailed balance does *not* hold in this situation, since if we define the stationary probability *p*^{*}_{i} for state *x*_{i} to be the Boltzmann probability , and the transition probabilities given by Equation (1), then it is not always the case that *p*^{*}_{i} · *p*_{i,j} = *p*^{*}_{j} · *p*_{j,i}. For instance, the empty structure *S* =………. on the 10-mer `GGGGGCCCCC` has 18 immediate neighbors, one of which is *T* = (……). The structure *T* has 11 immediate neighbors, one of which is the empty structure *S*. Letting *x*_{i} = *S* and *x*_{j} = *T*, we have *E*(*x*_{i}) = 0 kcal/mol, *E*(*x*_{j}) = 2.70 kcal/mol, ensemble free energy is −*RTln*(*Z*) = −3.96, hence *Z* = exp(3.96/*RT*) where *T* = 310°C so *Z* = 621.5 and we have stationary probabilities , , and . We compute that

Summarizing, in the Metropolis algorithm (with modified line 11), reversibility of a Markov chain depends on the permissible local moves, while in the Metropolis–Hastings algorithm (with line 11 as in Fig. 2), reversibility is always ensured. In the case at hand, if every secondary structure is an immediate neighbor of every secondary structure, then in the Metropolis algorithm, transition probabilities would be

(3)

where is the number of secondary structures. In this case, an easy computation shows that the Markov chain is reversible. Despite the non-reversible nature of the Markov chain corresponding to the Metropolis algorithm, whose states are the secondary structures of a given RNA sequence, and whose local moves consist of the addition or removal of a single base pair, it has been a standard practice to apply the Metropolis algorithm in this case (Danilova *et al.*, 2006; Flamm *et al.*, 2000; Isambert and Siggia, 2000; Xayaphoummine *et al.*, 2005). For that reason, we do not hesitate to apply the WL algorithm for the study of RNA secondary structure formation.

Note that in Figure 3, the WL computes the relative density of states, defined by *g*(*i*) = *N*(*e*_{i})/*N*, where *N*(*e*_{i}) is the number of states having energy *e*_{i} and *N* is the total number of states. In the case of RNA secondary structures, it is simple to compute the total number of secondary structures by dynamic programming, given as follows. Given an RNA sequence of length *n*, let *BP*_{i,j} = 1 if positions *i*,*j* can form a Watson–Crick or wobble pair, otherwise let *BP*_{i,j} = 0. Let θ = 3 denote the minimum number of unpaired bases in a hairpin loop. Letting *N*_{i,j} denote the number of secondary structures on subsequence [*i*, *j*] of the given RNA sequence, we have that *N*_{i,j} = 0 if *j* < *i* + 3, and otherwise

It follows that the total number of secondary structures is then *N*_{1,n}. From the relative density of states computed by WL algorithm, we compute the absolute density of states by

For fixed temperature *T* for which the WL computation was done, we can compute the partition function *Z*(*T*) = ∑_{S} exp(−*E*(*S*)/*RT*) by

(4)

In their original article Wang and Landau (2001a, b) mentioned that in the case of the Ising model, Equation (4) allows one to compute the partition function at any desired temperature *T* from the density of states. Unfortunately, this is no longer the case for the Turner nearest neighbor model Xia *et al.*, 1999 of RNA secondary structure, since the free energy parameters for stacked base pairs, hairpins, bulges, internal loops, etc. all depend on temperature. We can nevertheless proceed by computing the density of states for free energy at *T* = 37°C, and the density of states for enthalpy (assumed to be temperature independent), and then by convoluting these values, we obtain the density of states for free energy at any desired temperature.

Figure 4a displays the relative density of states for the free energy of secondary structures of the 45 nt flavivirus capsid hairpin (cHP) with EMBL access code AB010982/1-45. Figure 4b displays the sum of squared differences between the density of states and the best fitting normal distribution, respectively, extreme value distribution. The cHP is a conserved RNA hairpin structure in the capsid-coding region of flavivirus genomes. Note that the relative density of states, or energy histogram, is approximately normal. In Clote *et al.* (2009) it is rigorously proved that the relative density of states is asymptotically normal; specifically, it is shown that the limit, as *n* approaches infinity, of the relative density of states for an RNA sequence of length *n* is normal, where for the purpose of mathematical analysis it is assumed that any base can pair with any other base (homopolymer model) and that the energy of a secondary structure *S* is −1 times the number of base pairs in *S* (Nussinov energy model; Nussinov and Jacobson, 1980).

Following the approach in program `RNAcofold` of Bernhart *et al.* (2006), we can modify the WL program of Figure 3 to compute the density of states for all *hybridizations* of two RNA sequences, where both intermolecular and intramolecular base-pairing is allowed, provided that there are no pseudoknots.

In the case of the hybridization of twoRNA secondary structures, the first of length *n* and the second of length *m*, we can compute the total number of hybridizations as follows. Given an RNA sequence *A* = *a*_{1},…, *a*_{n} of length *n* and an RNA sequence *B* = *b*_{1},…, *b*_{m} of length *m*, let *HP*_{i,j} = 1 if positions *a*_{i}, *b*_{j} can hybridize, forming a Watson–Crick or wobble pair, otherwise let *HP*_{i,j} = 0. For 1 ≤ *i*, *j* ≤ *n*, 1 ≤ *k*, ≤ *m*, let *H*_{i,j;k,} denote the number of hybridizations of the subsequence *a*_{i},…, *a*_{j} with *b*_{k},…, *b*_{}. From Equation (3), we can compute the number *NA*_{x,y}, respectively, *NB*_{x,y} of secondary structures on subsequence *a*_{x},…,*a*_{y} of *A*, respectively, *b*_{x},…, *b*_{y} of *B*. If *j* < *i* or < *k*, then defined *H*_{i,j;k,} = 0; otherwise define *H*_{i,j;k,} by

(5)

It follows that the total number of pseudoknot-free hybridizations is then *H*_{1,n;1,m}.^{5} The previous algorithm is clearly *O*(*n*^{4}).

By considering the number of hybridizations to be the same as the number of secondary structures of a chimeric sequence, formed by concatenating *A*, *B* to form *c*_{1},…, *c*_{n+m} = *a*_{1},…, *a*_{n}, *b*_{1},…, *b*_{m}, we have an *O*(*n*^{3}) algorithm, as follows. For 1 ≤ *i*, *j* ≤ *n* + *m*, if *j* < *i* or (1 ≤ *i*, *j* ≤ *n*, *j* − *i* ≤ θ = 3), then *N*_{i,j} = 0, while if 1 ≤ *i* ≤ *n*, *n* + 1 ≤ *j* ≤ *n* + *m*, then *N*_{i,j} = 1; otherwise *N*_{i,j} is equal to

It follows that the total number of hybridizations is then *N*_{1,n}.

We now describe how to compute the *melting temperature T*_{M} of hybridization.

- Compute number of structures for each of five species (temperature independent): (
*A*), (*B*), (*AA*), (*BB*) and (*AB*). - For temperature
*T*{0°C,…, 100°C}, compute*relative density of states f*(*A*,*T*),*f*(*B*,*T*),*f*(*AA*,*T*),*f*(*BB*,*T*) and*f*(*AB*,*T*) for each species by WL. - For temperature
*T*{0°C,…, 100°C}, compute partition functions*Z*(*A*,*T*),*Z*(*B*,*T*),*Z*(*AA*,*T*),*Z*(*BB*,*T*) and*Z*(*AB*,*T*) bywhere*absolute density of states g*(*E*) is relative density times number of structures. For instance - Following Dimitrov and Zuker (2004), for temperature
*T*{0°C,…, 100°C}, compute ensemble free energy Δ*G*(*A*,*T*), Δ*G*(*B*,*T*), Δ*G*(*AA*,*T*), Δ*G*(*BB*,*T*) and Δ*G*(*AB*,*T*). This involves the following.- Redundancy correction:
- Symmetry correction:
- Temperature-dependent chemical equilibrium constants:
- Temperature-dependent concentration (number) of molecules A and B:where
*N*_{A}^{0},*N*_{B}^{0}are given and*K*_{A},*K*_{B},*K*_{AB}are obtained from the previous step. Values*N*_{A}and*N*_{B}are gotten by using, for example, Newton's method for solving two nonlinear functions; due to issues of numerical instability, Markham uses binary search (p. 43 of Markham, 2006). - Letting
*Z*(*A*,*B*,*AB*,*AA*,*BB*) equal the following expression:it follows that the total partition function*Z*satisfieswhich can be approximated by the term*Z*(*A*,*B*,*AB*,*AA*,*BB*) where*N*_{A},*N*_{B},*N*_{AB},*N*_{AA},*N*_{BB}obtained as previously explained. The chemical potential μ_{X}for each species*X*is the partial derivative of ensemble free energy with respect to number of molecules of*X*, hencesoTotal free energy satisfieswhich simplifies to - Normalize the ensemble free energy in terms of energy per mole of solute:

- Determine
*heat capacity*as a function of temperature byby computing the second partial of a fitting parabola determined by 2m+1 evenly spaced points, using the approximation for given byIn a post-processing step, smooth the heat capacity curve by computing a running average. The melting temperature*T*_{M}(*C*_{p}) is computed by determining the temperature at which heat capacity achieves a maximum.

The Figure 5a displays the run time of the WL method, compared with that of `RNAsubopt` from the Vienna RNA package, while the Figure 5b of the same figure shows sample output from our WL program. Figure 5 clearly shows the advantage of WL over existent methods in computing the density of states for both single RNA molecules and for hybridization complexes of two RNA molecules. Figure 6a and b depicts the heat capacity computed by the WL method (Fig. 5a) and the program `UNAFold` (Fig. 5b). Melting temperature, which is usually defined as the temperature at which half of the molecules are single-stranded, while the other half are hybridized, is determined as that temperature where heat capacity achieves its maximum. The program `UNAFold` does not allow any intramolecular structure (base pairing between 2 nt of the same structure), a feature that our WL method permits, as does the `RNAcofold` program. While it is clear that additional work must be done to improve heat capacity computation with the WL method, the melting temperature *T*_{M} computed by WL agrees reasonably well with that computed by *O*(*n*^{3}) methods `UNAFold`, `RNAcofold`, and the recent *O*(*n*^{6}) method of Chitsaz *et al.* (2009) each of which methods admits slightly different interactions.

Computation of heat capacity *c*_{P}(*T*) for the toy sequence 5′-`AGCGA`-3′, hybridized to its reverse complement 3′-`UCGCU`-5′. (**a**) Graph generated by WL method described in this article. (**b**) Graph generated by the program `UNAFold` **...**

We now describe how to approximately compute the partition function *Z*(*T*) over all secondary structures and over all pseudoknot-free hybridizations, simultaneously over all temperatures from 0°C to 100°C, by performing two WL computations, followed by a computation of the convolution of enthalpy relative frequency with free-energy relative frequency. Similar computations using existent methods require over 100 cubic time computations.

- Compute the relative density of states
*p*_{h}for free energy using WL with temperature*T*= −273°C (absolute zero Kelvin). It follows that*p*_{h}is the relative density of states for enthalpy, Due to the fundamental thermodynamic relationwhere(6)*T*(K) is absolute temperature and Δ*G*, Δ*H*, Δ*S*, respectively, denote the change in free energy, enthalpy and entropy. - Compute the relative density of states
*p*_{g}for free energy using WL with temperature*T*= 37°C (310 K). - From Equation (6), we have that
- Given arbitrary absolute temperature
*T*, compute the relative density of states for free energy at temperature*T*by the following pseudocode, representing a kind of*convolution*of*p*_{g}with*p*_{h}. - Compute the absolute density of states
*g*(*z*) =*p*(*z*) ·*N*, where*N*is the total number of secondary structures, computed by Equation (3).

By this method, one can approximate the partition function *Z*(*T*) for all temperatures from 0°C to 100°C, by performing two WL sampling runs, respectively, at temperatures −373°C and 37°C, and then to repeatedly perform a fast convolution. The method just described, which involves two WL computations, together with convolution computations, has until now not worked well in practice, for certain technical reasons. This direction needs further exploration.

Another issue concerning any sampling method is the required time to obtain reasonably good estimates of the quantity in question. In the case of RNA kinetics, computations of MFPT to reach the minimum free-energy structure take inordinate amounts of time, when using Metropolis–Hastings Monte Carlo methods, which are *time-driven* simulations. For this reason, the program `KinFold` (Flamm *et al.*, 2001) uses an *event-driven* simulation, where time is incremented by an exponentially distributed random variable. It may be possible to use similar ideas to increase efficiency of our WL program, which should further improve the accuracy in the computation of heat capacity. Finally, we intend to implement a new energy evaluation function, that allows arbitrary pseudoknots, zig-zags, etc. using energy parameters from the recent dissertation of Bon (Bon, 2009). This will allow us to estimate the partition function, ensemble free energy, heat capacity, melting temperature, etc. for a context known to be NP-complete.

In this article, we have implemented the WL algorithm to compute the relative density of states for RNA secondary structures and hybridizations. Separately computing the number of structures and hybridizations, we obtain the absolute density of states, which then yields the partition function, and thence, in the case of hybridization, the melting temperature. The WL method is much faster than existent software `RNAsubopt` in computing the density of states, but could not be benchmarked with the binning method of Cupal *et al.* (1996) which runs in *O*(*m*^{2}*n*^{3}) time, for length *n* sequence and *m* energy bins, since the latter software is no longer available, being superceded by `RNAsubopt-D`. In preliminary tests, we obtain roughly the same melting temperature for duplex RNA, as that computed by existent methods; however, the real advantage of the WL method is that there is no restriction on types of allowed interaction, unlike the situation with dynamic programming approaches that disallow pseudoknots, zig-zags, etc.

We would like to thank Michael Zuker, for a discussion on melting temperature and Ivo Hofacker, for informing us of the status of the program of Cupal *et al.* (1996), and for explaining the undocumented option `-D` of `RNAsubopt`, which computes the density of states from the enumeration of all secondary structures within a certain energy range of the minimum free energy. As well, thanks to three anonymous reviewers for their suggestions.

*Funding*: Digiteo Foundation (to P.C. and F.L.), in the form of a Digiteo Chair of Excellence to P.C. National Science Foundation (grants DMS-0817971 and DBI-0543506 to P.C.).

*Conflict of Interest*: none declared.

^{1}A base triple in *S* consists of two base pairs (*i*, *j*), (*i*, ) *S* or (*i*, *j*), (*k*, *j*) *S*. A pseudoknot in *S* consists of two base pairs (*i*, *j*), (*k*, ) *S* with *i* < *k* < *j* < .

^{2}According to Reinisch and Wolin (2007), one of the functions of Y RNA is to bind to certain misfolded RNAs, including 5S rRNA, as part of a quality control mechanism.

^{3}There is some controversy about the extent to which RNA secondary structure constrains the tertiary structure. See Cho *et al.* (2009) for more on this point.

^{4}The minimum energy pseudoknotted structure can be computed by maximum weight matching in *O*(*n*^{3}) time for the simple Nussinov energy model (Tabaska *et al.*, 1998).

^{5}In the literature, various types of hybridization are allowed. In Dimitrov-Zuker (2004), no intramolecular structure is allowed, while in Bernhardt *et al.* (2006) pseudoknot-free hybridizations are allowed with intramolecular structure.

- Abrahams JP, et al. Prediction of RNA secondary structure, including pseudoknotting, by computer simulation. Nucleic Acids Res. 1990;18:3035–3044. [PMC free article] [PubMed]
- Banerjee AR, et al. Thermal unfolding of a group I ribozyme: The low-temperature transition is primarily disruption of tertiary structure. Biochemistry. 1993;32:153–163. [PubMed]
- Bekaert M, et al. Towards a computational model for −1 eukaryotic frameshifting sites. Bioinformatics. 2003;19:327–335. [PubMed]
- Bernhart SH, et al. Partition function and base pairing probabilities of RNA heterodimers. Algorithms. Mol. Biol. 2006;1:3. [PMC free article] [PubMed]
- Böck A, et al. Selenoprotein synthesis: An expansion of the genetic code. Trends Biochem. Sci. 1991;16:463–467. [PubMed]
- Bon M. Prédiction de structures secondaires d'ARN avec pseudo-noeuds. PhD thesis, Ecole Polytechnique, 2009. Paris: Ph.D. dissertation in Physics; 2009.
- Bradley P, et al. Toward high-resolution de novo structure prediction for small proteins. Science. 2005;309:1868–1871. [PubMed]
- Cheah MT, et al. Control of alternative RNA splicing and gene expression by eukaryotic riboswitches. Nature. 2007;447:497–500. [PubMed]
- Chen Z, Xu Y. Structure prediction of helical transmembrane proteins at two length scales. J. Bioinform. Comput. Biol. 2006;4:317–333. [PubMed]
- Chitsaz H, et al. A partition function algorithm for interacting nucleic acid strands. Bioinformatics. 2009;25:i365–i373. [PMC free article] [PubMed]
- Cho SS, et al. Assembly mechanisms of RNA pseudoknots are determined by the stabilities of constituent secondary structures. Proc. Natl Acad. Sci. USA. 2009;106:17349–17354. [PubMed]
- Chowdhury S, et al. Temperature-controlled structural alterations of an RNA thermometer. J. Biol. Chem. 2003;278:47915–47921. [PubMed]
- Clote P, Backofen R. Computational Molecular Biology: An Introduction. Chichester, New York, Weinheim, Brisbane, Singapore, Toronto: John Wiley & Sons; 2000. p. 279.
- Clote P, et al. Asymptotics of canonical and saturated RNA secondary structures. J. Bioinform. Comput. Biol. 2009;7:869–893. [PubMed]
- Cupal J, et al. Dynamic programming algorithm for the density of states of RNA secondary structures. In: Hofstädt R, Lengauer T, Löffler M, Schomburg D, editors. Computer Science and Biology 96 (Prooceedings of the German Conference on Bioinformatics) Univ. Leipzig.; 1996. pp. 184–186.
- Danilova LV, et al. RNAKinetics: a web server that models secondary structure kinetics of an elongating RNA. J. Bioinform. Comput. Biol. 2006;4:589–596. [PubMed]
- Das R, Baker D. Automated de novo prediction of native-like RNA tertiary structures. Proc. Natl Acad. Sci. USA. 2007;104:14664–14669. [PubMed]
- Dimitrov RA, Zuker M. Prediction of hybridization and melting for double-stranded nucleic acids. Biophys. J. 2004;87:215–226. [PubMed]
- Dirks RM, et al. Thermodynamic analysis of interacting nucleic acid strands. SIAM Rev. 2007;49:65–88.
- Dirks RM, Pierce NA. A partition function algorithm for nucleic acid secondary structure including pseudoknots. J. Comput. Chem. 2003;24:1664–1677. [PubMed]
- Eddy SR. How do RNA folding algorithms work? Nat. Biotechnol. 2004;22:1457–1458. [PubMed]
- Flamm C, et al. RNA folding at elementary step resolution. RNA. 2000;6:325–338. [PubMed]
- Flamm C, et al. Design of multistable RNA molecules. RNA. 2001;7:254–265. [PubMed]
- Griffiths-Jones S, et al. Rfam: an RNA family database. Nucleic Acids Res. 2003;31:439–441. [PMC free article] [PubMed]
- Hofacker IL, et al. Fast folding and comparison of RNA secondary structures. Monatsh. Chem. 1994;125:167–188.
- Hofacker IL. Vienna RNA secondary structure server. Nucleic Acids Res. 2003;31:3429–3431. [PMC free article] [PubMed]
- Isambert H, Siggia ED. Modeling RNA folding paths with pseudoknots: application to hepatitis
*delta*virus ribozyme. Proc. Natl Acad. Sci. USA. 2000;97:6515–6520. [PubMed] - Kirkpatrick S, et al. Optimization by simulated annealing. Science. 1983;220:671–680. [PubMed]
- Knudsen B, Hein J, et al. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res. 2003;31:3423–3428. [PMC free article] [PubMed]
- Kou SC, et al. Equi-energy sampler with applications in statistical inference and statistical mechanics. Ann. Stat. 2006a;34:1581–1652.
- Kou SC, et al. A study of density of states and ground states in hydrophobic-hydrophilic protein folding models by equi-energy sampling. J. Chem. Phys. 2006b;124:244903. [PubMed]
- Lefebvre F. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. 1995. An optimized parsing algorithm well-suited to rna folding. In AAAI press, editor; pp. 222–230. [PubMed]
- Lim LP, et al. Vertebrate microRNA genes. Science. 2003;299:1540. [PubMed]
- Lyngso RB, Pedersen CN. RNA pseudoknot prediction in energy-based models. J. Comput. Biol. 2000;7:409–427. [PubMed]
- Mandal M, et al. Riboswitches control fundamental biochemical pathways in Bacillus subtilis and other bacteria. Cell. 2003;113:577–586. [PubMed]
- Markham NR, Zuker M. UNAFold: software for nucleic acid folding and hybridization. Methods Mol. Biol. 2008;453:3–31. [PubMed]
- Markham NR. Algorithms and software for nucleic acid sequences. Troy, New York: 2006.
- Mathews DH, et al. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc. Natl Acad. Sci. USA. 2004;101:7287–7292. [PubMed]
- Metropolis N, et al. Equation of state calculations by fast computing machines. J. Chem. Phys. 1953;21:1087–1092.
- Metzler D, Nebel ME. Predicting RNA secondary structures with pseudoknots by MCMC sampling. J. Math. Biol. 2008;56:161–181. [PubMed]
- Mückstein U, et al. Thermodynamics of RNA-RNA binding. Bioinformatics. 2006;22:1177–1182. [PubMed]
- Nawrocki EP, et al. Infernal 1.0: inference of RNA alignments. Bioinformatics. 2009;25:1335–1337. [PMC free article] [PubMed]
- Nussinov R, Jacobson AB. Fast algorithm for predicting the secondary structure of single stranded RNA. Proc. Natl Acad. Sci. USA. 1980;77:6309–6313. [PubMed]
- Omer AD, et al. Homologues of small nucleolar RNAs in Archaea. Science. 2000;288:517–522. [PubMed]
- Ortiz AR, et al. Fold assembly of small proteins using Monte Carlo simulations driven by restraints derived from multiple sequence alignments. J. Mol. Biol. 1998;277:419–448. [PubMed]
- Reeder J, Giegerich R. Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC. Bioinformatics. 2004;5:104. [PMC free article] [PubMed]
- Reinisch KM, Wolin SL. Emerging themes in non-coding RNA quality control. Curr. Opin. Struct. Biol. 2007;17:209–214. [PubMed]
- Ren J, et al. Hotknots: heuristic prediction of RNA secondary structures including pseudoknots. RNA. 2005;11:1494–1504. [PubMed]
- Rivas E, Eddy SR. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J. Mol. Biol. 1999;285:2053–2068. [PubMed]
- Tabaska JE, et al. An RNA folding method capable of identifying pseudoknots and base triples. Bioinformatics. 1998;14:691–699. [PubMed]
- Tucker BJ, Breaker RR. Riboswitches as versatile gene control elements. Curr. Opin. Struct. Biol. 2005;15:342–348. [PubMed]
- Van Batenburg FH, et al. Pseudobase: structural information on RNA pseudoknots. Nucleic Acids Res. 2001;29:194–195. [PMC free article] [PubMed]
- Wang F, Landau DP. Determining the density of states for classical statistical models: a random walk algorithm to produce a flat histogram. Phys. Rev. E. 2001a;64:056101(1)–056101(16). [PubMed]
- Wang F, Landau DP. Efficient, multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Lett. 2001b;86:2050–2053. [PubMed]
- Weinger JS, et al. Substrate-assisted catalysis of peptide bond formation by the ribosome. Nat. Struct. Mol. Biol. 2004;11:1101–1106. [PubMed]
- Wiese KC, et al. JViz.Rna–a Java tool for RNA secondary structure visualization. IEEE. Trans. Nanobiosci. 2005;4:212–218. [PubMed]
- Wuchty S, et al. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers. 1999;49:145–164. [PubMed]
- Xayaphoummine A, et al. Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots. Nucleic Acids Res. 2005;33:W605–W610. [PMC free article] [PubMed]
- Xia T., Jr., et al. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry. 1999;37:14719–14735. [PubMed]
- Zhang J, et al. Biopolymer structure simulation and optimization via fragment regrowth Monte Carlo. J. Chem. Phys. 2007;126:225101. [PubMed]
- Zhao J, et al. Rapid ab initio prediction of RNA pseudoknots via graph tree decomposition. J. Math. Biol. 2008;56:145–159. [PubMed]
- Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981;9:133–148. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |