PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of narLink to Publisher's site
 
Nucleic Acids Res. 2011 January; 39(1): 292–299.
Published online 2010 September 1. doi:  10.1093/nar/gkq642
PMCID: PMC3017586

The ends of a large RNA molecule are necessarily close

Abstract

We show on general theoretical grounds that the two ends of single-stranded (ss) RNA molecules (consisting of roughly equal proportions of A, C, G and U) are necessarily close together, largely independent of their length and sequence. This is demonstrated to be a direct consequence of two generic properties of the equilibrium secondary structures, namely that the average proportion of bases in pairs is ~60% and that the average duplex length is ~4. Based on mfold and Vienna computations on large numbers of ssRNAs of various lengths (1000–10 000 nt) and sequences (both random and biological), we find that the 5′–3′ distance—defined as the sum of H-bond and covalent (ss) links separating the ends of the RNA chain—is small, averaging 15–20 for each set of viral sequences tested. For random sequences this distance is ~12, consistent with the theory. We discuss the relevance of these results to evolved sequence complementarity and specific protein binding effects that are known to be important for keeping the two ends of viral and messenger RNAs in close proximity. Finally we speculate on how our conclusions imply indistinguishability in size and shape of equilibrated forms of linear and covalently circularized ssRNA molecules.

INTRODUCTION

There are many situations in which it is biologically important for the two ends of a large RNA molecule to be close to each other. In animal viruses with single-stranded (ss) RNA genomes, for example, efficient replication of the genome has been shown to depend on its effective ‘circularization’. More explicitly, complementary sequences have been identified at or near the 5′- and 3′-ends that are responsible for forming ‘panhandles’ that keep the two ends close together. These panhandles are duplexes that are 21 bp in the case of yellow fever virus (1), and 15 bp in the case of influenza A (2), thereby according them unusual robustness. Another example where RNA genome circularization of this kind has been implicated in RNA replication is sindbis virus; here an 18 bp 5′–3′ panhandle has been shown to survive denaturing conditions sufficient to eliminate much of the remaining secondary structure, leaving the genome with a circular appearance in electron micrographs (3). In dengue, also (like yellow fever, influenza A and sindbis) a positive-sense RNA virus, minus-strand synthesis involves long-distance 5′–3′ base pairing that facilitates the transfer of the RNA-dependent RNA polymerase from its binding site at the 5′-end to the initiation site at the 3′-end (4). Similarly, circularization of HIV-1 has been shown to arise from base pairing between the 5′- and 3′-ends of the RNA genome (5); these interactions are found to occur as well in different HIV-1 subtypes with large sequence variation, suggesting they share an evolutionary basis.

It has also long been known that effective circularization of messenger RNA molecules is important for efficient translation. The 5′- ‘capping’ and 3′-polyadenylation of mRNAs—through a variety of specific protein-binding events—result in the association of the two ends of the molecules and subsequent formation of translation initiation complexes (6). In eukaryotes, for example, the 3′-poly(A) ‘tail’ interacts with the poly(A)-binding protein, the 5′-G-cap binds a eukaryotic initiation factor, and these two bound proteins—with the full length of mRNA intervening—simultaneously bind a ‘bridging’ protein. This effective circularization of the molecule results in recruitment of the 40S ribosomal subunit (via binding of still another protein) and initiation of translation.

Because circularization of mRNA is so important for its translation, mechanisms that co-localize the ends have evolved even in cases where the molecules are not capped or polyadenylated. Plant viruses, for example, often lack both of these special sequences and yet are translated efficiently (7,8). The effective circularization is enhanced by direct base pairing between sub-sequences in the untranslated regions (UTRs) at the 5′- and 3′-ends; the UTRs functionally replace the G-cap and poly(A) tail. Further, the RNAs of many positive-sense (mRNA) viruses have internal ribosome entry sites (IRESs) at their 5′-ends, i.e. subsequences that recruit ribosomes and initiate translation (9,10).

In all of the above examples—involving both direct interaction between 5′- and 3′-ends or interaction mediated by binding proteins—particular, evolved, subsequences are involved in effective circularization. But in all of these scenarios, an even more fundamental requirement is that the two ends of the fluctuating molecule must spend enough time near each other in order for there to be a high probability for the special elements—RNA subsequences or binding proteins—to find one another. More explicitly, we will argue here that effective circularization of large RNA molecules is achieved through generic properties of secondary structure that are essentially independent of sequence. The specific evolved subsequences mentioned above are not needed so much for circularization as for facilitating the binding of particular proteins—e.g. RNA replicases and ribosome initiation factors—that are important for biological function of the circularized RNA.

Consider the analogous situation of double-stranded (ds) DNA with ‘sticky’ ends arising from complementary ss overhangs (generated, say, by a restriction enzyme). Here the probability of the two ends being covalently bound by a ligase is directly determined by—and ultimately limited by—the likelihood that they are close enough to each other to bind, i.e. that the double helix can twist and bend enough for its two ends to get close together (11). This classic problem is informed by the well-known statistical mechanical result giving the likelihood of the ends of a linear, semiflexible, polymer being within a monomer distance of one another. For sufficiently long molecules An external file that holds a picture, illustration, etc.
Object name is gkq642i1.jpg this probability is of order An external file that holds a picture, illustration, etc.
Object name is gkq642i2.jpg where An external file that holds a picture, illustration, etc.
Object name is gkq642i3.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i4.jpg are the contour and persistence lengths, respectively, of the linear polymer; the contour length is the number of monomers times the average inter-monomer distance, and the persistence length is the distance along the chain contour beyond which the polymer can bend almost freely (12). Thus, the circularization probability of long DNA is small because An external file that holds a picture, illustration, etc.
Object name is gkq642i5.jpg is large, i.e. the molecule is long compared to its persistence length (50 nm, for DNA): maximization of configurational entropy requires that the ends be far apart. The small probability of finding them close, decreasing as An external file that holds a picture, illustration, etc.
Object name is gkq642i6.jpg reflects directly the fact that the root-mean-square distance between the ends of the molecule is increasing as An external file that holds a picture, illustration, etc.
Object name is gkq642i7.jpg.

To understand the basis for effective circularization of ssRNA, then, it is natural to ask: is there, in analogy with dsDNA, a generic result for the probability of finding the two ends of an RNA molecule close to one another, and how different is it from that for a linear polymer? In this article we argue that there is indeed a universal distribution of end-to-end distances in large RNA molecules, and furthermore that it is essentially independent of overall sequence and length. We show in particular that the distance between ends is necessarily small, because of generic features of the secondary structure, notably that the percentage (f) of paired nucleotides (nt) is ~60% and that the average duplex length (An external file that holds a picture, illustration, etc.
Object name is gkq642i8.jpg) is ~4. Using an early variant of the RNA folding algorithm developed by Zuker et al. (13,14), Fontana et al. (15) have calculated various characteristics of the minimum free energy (MFE) structure corresponding to several different types of short (20–100) nucleotide sequences. Averaging over many sequences of the same length (number of nucleotides, N) and base composition (An external file that holds a picture, illustration, etc.
Object name is gkq642i9.jpg), they found that An external file that holds a picture, illustration, etc.
Object name is gkq642i10.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i11.jpg approach a constant value with increasing N. They also calculated a property (the number of unpaired bases in ‘joints’ and ‘free ends’) that is closely related to our definition of the 5′–3′ distance (see next section), finding that for the short chains analyzed this number increases, yet with a gradually decreasing slope, as An external file that holds a picture, illustration, etc.
Object name is gkq642i12.jpg increases. The constancy of An external file that holds a picture, illustration, etc.
Object name is gkq642i13.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i14.jpg has been confirmed for a wide range of biological (viral and yeast) ssRNA sequences (16) by application of the mfold and Vienna codes for predicting thermally accessible secondary structures.

For certain models of polynucleotide chains, the An external file that holds a picture, illustration, etc.
Object name is gkq642i15.jpg-independence of An external file that holds a picture, illustration, etc.
Object name is gkq642i16.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i17.jpg has been proven analytically, using a variety of powerful theoretical tools. Hofacker et al. (17), applying an elegant graph-theoretic approach, derived exact results for these properties (see their Table 3) and various other secondary structure attributes of RNA-like heteropolymers. Their results apply to an idealized ensemble where all possible secondary structures have equal statistical weight, resulting in low values of An external file that holds a picture, illustration, etc.
Object name is gkq642i18.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i19.jpg. More recently, Clote et al. (18), using the Nussinov–Jacobson (‘maximum base pairing’) model (19) have shown that, for an ssRNA chain with Watson–Crick pairing rules, An external file that holds a picture, illustration, etc.
Object name is gkq642i20.jpg approaches a constant value slightly exceeding 90% for An external file that holds a picture, illustration, etc.
Object name is gkq642i21.jpg large (>1000). Earlier, de Gennes had noted (20) that, for a random sequence of two complementary nucleotides, the distance between chain ends remains finite even as An external file that holds a picture, illustration, etc.
Object name is gkq642i22.jpg approaches infinity. Based on this notion he also concluded that ‘  many properties of a large, open, strand are not very different from those of a cyclic strand of equal molecular length’ (20). We elaborate on this idea in the next section.

Our goal in the present work is to emphasize the generality of the proximity of the 5′- and 3′-ends of large RNA molecules of arbitrary length and sequence. Based on the general findings noted above for large ssRNA chains, we derive a simple expression for the 5′–3′ distance that can be evaluated numerically for sequences of given An external file that holds a picture, illustration, etc.
Object name is gkq642i23.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i24.jpg. We also calculate this distance using the RNAsubopt (21,22) and mfold (23,24) folding algorithms. A further consequence of our analyses is that the secondary—and hence tertiary—structures of linear and covalently-circularized RNA molecules are practically identical. These conclusions are tested against several systematic calculations of secondary structures for specific linear and circular sequences, both random and viral.

METHODS

Figure 1A displays the MFE secondary structure of a rather short (200 nt) random-sequence ssRNA molecule, composed of equal numbers of A, C, G and U, as predicted by the mfold algorithm (23,24). The duplexes are represented in the usual way by straight ‘ladders’ and the loops by circles of different sizes. The same secondary structure is visualized slightly less schematically in Figure 1B, with more realistic scaling of duplex dimensions, using the jViz.Rna drawing program (25). This latter representation illustrates that the dangling ss segments in the ‘exterior loop’—the one including the 5′- and 3′-ends—are independent flexible chains. In Figure 1C the secondary structure is mapped into a tree graph, where each edge (bond) represents a duplex and the vertices represent the loops (15,17,26); the interior loops are denoted by solid circles, and the exterior loop by an open circle. The term ‘interior loop’ is conventionally defined as the chain of bases, both paired and unpaired, comprising a closed loop, excluding its closing (‘downstream’) base pair. In the following we slightly depart from this definition and include the closing base pair as part of the (hence closed) loop. Our definition of the exterior loop, which lacks a closing base pair, is identical to the conventional one, namely, it includes all bases (paired and unpaired) along the shortest connected (covalently or H-bonded) path from the 5′- to the 3′-end.

Figure 1.
Three different representations of the mfold-predicted minimum free energy secondary structure of a random 200 nt ssRNA of uniform composition (25% A, C, G, U). (A) Conventional schematic, drawn with mfold, showing base-paired regions (duplexes) ...

5′–3′ Distance

As a simple intuitive measure of the 5′–3′ distance (in a given secondary structure of a given sequence) we use the total number of nucleotide links comprising the exterior loop, i.e.

equation image
(1)

Here An external file that holds a picture, illustration, etc.
Object name is gkq642i25.jpg is the number of covalent (phosphodiester) bonds (hereafter also referred to as ss links) in the exterior loop and An external file that holds a picture, illustration, etc.
Object name is gkq642i26.jpg is the number of base-paired (H-bonded, ds) links in the exterior loop or, equivalently, the number of duplexes emanating from the exterior loop. As it is the total number of (ss and ds) links in the nucleotide chain constituting the exterior loop, we shall refer to An external file that holds a picture, illustration, etc.
Object name is gkq642i27.jpg as the ‘effective contour length’ of this loop. Expressing An external file that holds a picture, illustration, etc.
Object name is gkq642i28.jpg in the form An external file that holds a picture, illustration, etc.
Object name is gkq642i29.jpg where An external file that holds a picture, illustration, etc.
Object name is gkq642i30.jpg is the total number of nucleotides in the exterior loop, and noting that An external file that holds a picture, illustration, etc.
Object name is gkq642i31.jpg is the total number of paired bases in the exterior loop, it follows from Equation (1) that An external file that holds a picture, illustration, etc.
Object name is gkq642i32.jpg is the number of unpaired bases in this loop. Figure 2 illustrates an exterior loop where An external file that holds a picture, illustration, etc.
Object name is gkq642i36.jpg whereas in Figure 1 An external file that holds a picture, illustration, etc.
Object name is gkq642i37.jpg. It should be emphasized that the average physical distance between the 5′- and 3′-ends depends not only on An external file that holds a picture, illustration, etc.
Object name is gkq642i38.jpg but also on the specific sequence of the loop, as well as the number of duplexes branching from the loop. In fact the lengths of the covalent and H-bonded links are different (the latter are about three times larger). If all links were of equal length An external file that holds a picture, illustration, etc.
Object name is gkq642i39.jpg, and their joints were fully flexible, then the physical 5′–3′ distance would be roughly An external file that holds a picture, illustration, etc.
Object name is gkq642i40.jpg, where we have neglected excluded volume effects because of the shortness of the exterior loop (12). It follows that small, An external file that holds a picture, illustration, etc.
Object name is gkq642i41.jpg-independent, An external file that holds a picture, illustration, etc.
Object name is gkq642i42.jpg-values imply small, An external file that holds a picture, illustration, etc.
Object name is gkq642i43.jpg-independent physical distances between the two chain ends.

Figure 2.
Detailed view of an exterior loop consisting of An external file that holds a picture, illustration, etc.
Object name is gkq642i33.jpg covalent links and An external file that holds a picture, illustration, etc.
Object name is gkq642i34.jpg H-bonded links of nucleotides. The effective contour length of the loop is An external file that holds a picture, illustration, etc.
Object name is gkq642i35.jpg.

Four simple observations will guide our calculation of the 5′–3′ distance:

  1. The MFE secondary structures of a given linear ssRNA molecule and that of the circular RNA obtained by linking the 5′- and 3′-ends of the linear chain are very similar, and their energies practically identical. This is because the presence or absence of a covalent (phosphodiester) bond between the terminal nucleotides does not significantly alter overall base pairing. Its small influence on the configurational free energy of the molecule enters only through the entropy difference between the open exterior loop in the linear RNA and the corresponding closed (interior) loop in the circular analog. Actually, for any secondary structure of the linear ssRNA, not only the one of minimum free energy, the corresponding circular structure has essentially the same energetic and structural characteristics. Conversely, any secondary structure of a linear RNA can be regarded as derived from ‘cutting’ a specific covalent bond in one of the interior loops of the corresponding circular RNA. We thus expect that secondary structure characteristics of long RNA molecules, such as the pairing fraction or average duplex length, are practically the same for the linear and circularized ‘isomers’. These conclusions have been confirmed by numerical analyses of a large number of linear and circular RNA sequences of different lengths and compositions, as reported below and in Supplementary Figure S1 and Supplementary Table S1.
  2. As noted in the Introduction, for long chains (say An external file that holds a picture, illustration, etc.
Object name is gkq642i44.jpg) composed of comparable proportions of A, C, G and U (25 ± 5%), we find that An external file that holds a picture, illustration, etc.
Object name is gkq642i45.jpg for randomly-permuted sequences and for most viral RNAs (Tables 1 and and22).
    Table 1.
    Composition (An external file that holds a picture, illustration, etc.
Object name is gkq642i46.jpg)-dependence of the average percentage of bases paired (f), the average duplex length (k) and the average 5′–3′ distance (D), for different sets of random and yeast-derived sequences of length 3000 nt; each ...
    Table 2.
    Values of f, k and D for viral ssRNAs, determined with RNAsubopt
  3. For long chains, we also know that the average length of (i.e. number of base pairs in) a duplex, An external file that holds a picture, illustration, etc.
Object name is gkq642i58.jpg, is independent of An external file that holds a picture, illustration, etc.
Object name is gkq642i59.jpg and rather insensitive to An external file that holds a picture, illustration, etc.
Object name is gkq642i60.jpg (for compositions involving 25 ± 5% of the four bases). For nearly all the sets of sequences examined in this study—randomly-permuted, viral and yeast-derived—An external file that holds a picture, illustration, etc.
Object name is gkq642i61.jpg is between 4 and 5 (Tables 1 and and2;2; Supplementary Table S1).
  4. As is well known, every secondary structure can be represented by a tree graph (26), as illustrated in Figure 1C.

Two simple and important results can easily be proved from the tree graph analogy. First, the number of vertices, An external file that holds a picture, illustration, etc.
Object name is gkq642i62.jpg, and the number of bonds, An external file that holds a picture, illustration, etc.
Object name is gkq642i63.jpg, of a circular RNA are related by the equality An external file that holds a picture, illustration, etc.
Object name is gkq642i64.jpg. This relation is also valid for linear RNAs provided the exterior loop is also represented by a vertex (possibly differently labeled, as in Fig. 1C). Second, on average (over all loops in any given structure), each loop (vertex) is connected to An external file that holds a picture, illustration, etc.
Object name is gkq642i65.jpg duplexes (edges). For long (An external file that holds a picture, illustration, etc.
Object name is gkq642i66.jpg) sequences we also find An external file that holds a picture, illustration, etc.
Object name is gkq642i67.jpg (see below), in which case we can safely set An external file that holds a picture, illustration, etc.
Object name is gkq642i68.jpg which (unless otherwise stated) will be the value used in our calculations. Note that the averaging here is over all loops in a given structure. The same holds, of course, after averaging over any number of structures and/or sequences. Note also that we always have An external file that holds a picture, illustration, etc.
Object name is gkq642i69.jpg, with An external file that holds a picture, illustration, etc.
Object name is gkq642i70.jpg corresponding to a ‘hairpin’ loop, An external file that holds a picture, illustration, etc.
Object name is gkq642i71.jpg to a ‘bubble’ or ‘bulge,’ and An external file that holds a picture, illustration, etc.
Object name is gkq642i72.jpg to a ‘multi loop’.

Among the numerous possible secondary structures of long RNA sequences, there are often thousands whose free energies are just marginally higher (An external file that holds a picture, illustration, etc.
Object name is gkq642i73.jpg or less) than that of the MFE configuration, and under equilibrium conditions all these structures are nearly equally likely. Consequently, any property of the molecule that depends on its secondary structures should be averaged over their full thermal (Boltzmann) distribution. Suppose that, using RNAsubopt or a similar program, we have stochastically sampled the thermal ensemble of structures corresponding to a certain circular ssRNA sequence of given An external file that holds a picture, illustration, etc.
Object name is gkq642i74.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i75.jpg. As argued in (i), above, all the linear ssRNA molecules derived by cutting any covalent (ss) bond in any interior loop of any member of the above ensemble will fold into ensembles of structures that are practically identical both to each other, and to the ensemble of the original circular molecule. The only difference is the appearance of an exterior loop, which now contains the 5′- and 3′-ends. For every given circular structure containing An external file that holds a picture, illustration, etc.
Object name is gkq642i76.jpg interior loops, this cutting procedure yields An external file that holds a picture, illustration, etc.
Object name is gkq642i77.jpg linear ssRNA sequences, where An external file that holds a picture, illustration, etc.
Object name is gkq642i78.jpg is the total number of ss (covalent) bonds in all loops of the given structure, An external file that holds a picture, illustration, etc.
Object name is gkq642i79.jpg denoting the number of covalent bonds in loop An external file that holds a picture, illustration, etc.
Object name is gkq642i80.jpg. Noting that the total number of nucleotides in the closed loop An external file that holds a picture, illustration, etc.
Object name is gkq642i81.jpg, namely An external file that holds a picture, illustration, etc.
Object name is gkq642i82.jpg is equal to the total number of bonds in this loop (An external file that holds a picture, illustration, etc.
Object name is gkq642i83.jpg), we find An external file that holds a picture, illustration, etc.
Object name is gkq642i84.jpg, with An external file that holds a picture, illustration, etc.
Object name is gkq642i85.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i86.jpg denoting the number of unpaired and H-bonded nucleotides in loop An external file that holds a picture, illustration, etc.
Object name is gkq642i87.jpg, respectively, and An external file that holds a picture, illustration, etc.
Object name is gkq642i88.jpg the number of duplexes emerging from this loop. This yields An external file that holds a picture, illustration, etc.
Object name is gkq642i89.jpg. We have used the fact that the first sum is the total number of unpaired nucleotides, An external file that holds a picture, illustration, etc.
Object name is gkq642i90.jpg, and the fact that because every duplex is connected to two loops, the second sum is twice the total number (An external file that holds a picture, illustration, etc.
Object name is gkq642i91.jpg) of duplexes in the structure. But An external file that holds a picture, illustration, etc.
Object name is gkq642i92.jpg can be expressed in the form An external file that holds a picture, illustration, etc.
Object name is gkq642i93.jpg so that An external file that holds a picture, illustration, etc.
Object name is gkq642i94.jpg. Here, and in all subsequent analytical expressions involving An external file that holds a picture, illustration, etc.
Object name is gkq642i95.jpg, its numerical value will be understood to be the fraction of bases in pairs, rather than the percentage. As before, An external file that holds a picture, illustration, etc.
Object name is gkq642i96.jpg denotes the average duplex length in the particular sequence considered. For An external file that holds a picture, illustration, etc.
Object name is gkq642i97.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i98.jpg we find An external file that holds a picture, illustration, etc.
Object name is gkq642i99.jpg.

In the next section we present numerical calculations of the average 5′–3′ distance An external file that holds a picture, illustration, etc.
Object name is gkq642i100.jpg for two types of ssRNA molecules, biological (yeast-derived and viral) and randomly-permuted sequences. The random sequences were included both for direct comparison to the biological sequences, and for general theoretical interest. In each case, a Boltzmann-weighted average An external file that holds a picture, illustration, etc.
Object name is gkq642i101.jpg-value is determined for the thermal ensemble of structures associated with each sequence. We then report the mean of these ensemble-average An external file that holds a picture, illustration, etc.
Object name is gkq642i102.jpg-values for each set of sequences.

For the random sequences a simple theoretical prediction of An external file that holds a picture, illustration, etc.
Object name is gkq642i103.jpg (showing good agreement with the numerical calculation) can be derived based on two reasonable approximations, as argued in the Appendix 1. We show there that, for any given secondary structure of a very long (An external file that holds a picture, illustration, etc.
Object name is gkq642i104.jpg) ssRNA molecule, the 5′–3′ distance is given by

equation image
(2)

with An external file that holds a picture, illustration, etc.
Object name is gkq642i105.jpg denoting the average number of ss covalent bonds per interior loop in the structure considered. In terms of the pairing fraction, An external file that holds a picture, illustration, etc.
Object name is gkq642i106.jpg, and duplex length, An external file that holds a picture, illustration, etc.
Object name is gkq642i107.jpg, of this structure we obtain An external file that holds a picture, illustration, etc.
Object name is gkq642i108.jpg. For both the MFE structure and the canonical ensemble averages of secondary structures of random (but also viral) sequences containing roughly equal proportions of the four bases it is found that An external file that holds a picture, illustration, etc.
Object name is gkq642i109.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i110.jpg, yielding An external file that holds a picture, illustration, etc.
Object name is gkq642i111.jpg, and hence An external file that holds a picture, illustration, etc.
Object name is gkq642i112.jpg. See also Table 1.

Numerical computations

RNA sequences

Randomly-permuted ssRNA sequences were generated with a Fisher–Yates shuffle driven by a Mersenne Twister random number generator (27) implemented in C++ (by R. Wagner, University of Michigan, available at: www-personal.umich.edu/~wagnerr/MersenneTwister.html). Viral ssRNA sequences were obtained from the National Center for Biotechnology Information Genome Database (www.ncbi.nlm.nih.gov). Yeast (Saccharomyces cerevisiae) genomic sequences were obtained from the Saccharomyces Genome Database (www.yeastgenome.org).

Folding programs

Secondary structure predictions were made with two RNA folding programs, RNAsubopt, a program in the Vienna RNA Package, Version 1.7 (21,22), and mfold, Version 3.1 (23,24). These programs employ detailed empirically-based energy models to estimate the free energies of the non-pseudoknotted secondary structures that are formed by a specified ssRNA sequence. With RNAsubopt, it is possible to sample stochastically from the ensemble of secondary structures, with a sampling probability in proportion to each structure’s Boltzmann weight. Thus, sampling a sufficient number of structures (we use 1000), and averaging the An external file that holds a picture, illustration, etc.
Object name is gkq642i113.jpg-values for this set, gives a close approximation to the ensemble-average predicted value of the end-to-end distance for that sequence. In earlier work (16) we demonstrated that the average properties of subsets of 1000 structures are not significantly different from those of the complete ensemble of structures. More generally, for any property An external file that holds a picture, illustration, etc.
Object name is gkq642i114.jpg, its RNAsubopt-predicted ensemble-average value is calculated as An external file that holds a picture, illustration, etc.
Object name is gkq642i115.jpg, where An external file that holds a picture, illustration, etc.
Object name is gkq642i116.jpg is its value in the An external file that holds a picture, illustration, etc.
Object name is gkq642i117.jpg member of the stochastically-generated subset of the Boltzmann ensemble of secondary structures. In mfold, by contrast, an algorithm is used to generate a structurally diverse representation of the ensemble, rather than a thermally-representative average. We configured mfold to generate the 1000 lowest-energy structures from such a set, measured An external file that holds a picture, illustration, etc.
Object name is gkq642i118.jpg for each, and averaged them in proportion to their Boltzmann weights, to give an mfold-averaged An external file that holds a picture, illustration, etc.
Object name is gkq642i119.jpg-value. For any property An external file that holds a picture, illustration, etc.
Object name is gkq642i120.jpg, its mfold-predicted average value is An external file that holds a picture, illustration, etc.
Object name is gkq642i121.jpg with An external file that holds a picture, illustration, etc.
Object name is gkq642i122.jpg the free energy of the An external file that holds a picture, illustration, etc.
Object name is gkq642i123.jpg secondary structure relative to the MFE for that sequence.

RESULTS

While there can be significant inter-taxon variation, the average composition, An external file that holds a picture, illustration, etc.
Object name is gkq642i124.jpg, of the viral RNAs in this study is ~24% G, 22% C, 26% A and 28% U (16). With this ‘viral-like’ An external file that holds a picture, illustration, etc.
Object name is gkq642i125.jpg, we generated 2000 random sequences of lengths 50, 100, 200 and 400; 1000 of lengths 800 and 1500; 500 of lengths 2000, 2500, 3000 and 4000; 300 of lengths 5000, 6000 and 7000; and 1000 of length 8000. These sequences were folded with RNAsubopt. Figure 3 shows the mean An external file that holds a picture, illustration, etc.
Object name is gkq642i131.jpg and standard deviation for each length of RNA, and a regression line fitted to sequences of length 400 and greater. Except for the very short sequences, An external file that holds a picture, illustration, etc.
Object name is gkq642i132.jpg is ~12, independent of sequence length; in addition, it is relatively insensitive to small changes in An external file that holds a picture, illustration, etc.
Object name is gkq642i133.jpg. That this An external file that holds a picture, illustration, etc.
Object name is gkq642i134.jpg-value is identical to the estimate obtained above, through the theoretical calculation, is coincidental, because the latter is based on the somewhat approximate expression given in Eq. (2) (the approximations are explained in Appendix 1). But it is nevertheless very striking, and highly significant, that the simple theory predicts a An external file that holds a picture, illustration, etc.
Object name is gkq642i137.jpg-value that is of the correct magnitude and that is independent of length and sequence.

Figure 3.
Mean ensemble-averaged 5′–3′ distances, An external file that holds a picture, illustration, etc.
Object name is gkq642i126.jpg, from Equation (1), for random and viral sequences. Standard deviations are shown with vertical bars. The small black points represent the 10 groups of viral sequences listed in Table 2 ...

Table 1 shows the results for 500 3000-nt ssRNAs of viral-like and uniform An external file that holds a picture, illustration, etc.
Object name is gkq642i138.jpg, as well as 500 ssRNAs that are the transcripts of consecutive 3000 bp sections on yeast (S. cerevisiae) chromosomes XI and XII. In these sets, the values of An external file that holds a picture, illustration, etc.
Object name is gkq642i139.jpg, An external file that holds a picture, illustration, etc.
Object name is gkq642i140.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i141.jpg (averaged over the 500 sequences) were 12–14, ~60% and ~4, respectively. The last column in the table lists the values of An external file that holds a picture, illustration, etc.
Object name is gkq642i142.jpg calculated according to Equation (2), and these results are seen to agree closely with those from the detailed numerical calculations (especially for the random sequences, as expected).

The viral taxa analyzed are listed in Table 2. All are non-enveloped ssRNA viruses and, except for the rod-shaped Tobamoviruses, have An external file that holds a picture, illustration, etc.
Object name is gkq642i143.jpg icosahedral capsids. The Leviviridae infect bacteria, the Astroviridae and Caliciviridae are animal viruses, and the remainder infect plants. The Bromoviridae are, in addition, tripartite: the genome consists of three ssRNAs, divided among three separate capsids. The number of sequences analyzed in each case corresponds to the number of species considered.

From Figure 3 it can be seen that the values and standard deviations of D for the viral RNAs are higher, but overlap those of the random sequences for all taxa except the Tymoviruses. The latter can be understood from the fact that small An external file that holds a picture, illustration, etc.
Object name is gkq642i145.jpg-values are an inherent consequence of base pairing; all non-pathological secondary structures with a sufficiently high percentage of bases in pairs, An external file that holds a picture, illustration, etc.
Object name is gkq642i146.jpg, will have a low An external file that holds a picture, illustration, etc.
Object name is gkq642i147.jpg. The Tymoviruses show a relatively larger An external file that holds a picture, illustration, etc.
Object name is gkq642i148.jpg (although still small relative to sequence length) because they have a significantly smaller An external file that holds a picture, illustration, etc.
Object name is gkq642i149.jpg.

We note that current RNA folding programs have been shown to be limited in their ability to correctly predict individual base pairs in long ssRNA sequences (28). Consistent with this, RNAsubopt and mfold (which use slightly different energy models to generate their ensembles of secondary structures, and different algorithms to sample from these ensembles), when given long sequences to fold, output structures that often show significant differences in the details of base pairing, as well as overall appearance. However, our simple theoretical model predicts that An external file that holds a picture, illustration, etc.
Object name is gkq642i150.jpg depends only on the values of An external file that holds a picture, illustration, etc.
Object name is gkq642i151.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i152.jpg, which we have previously found to be robust with respect to the details of the folding program used (16). Consequently, An external file that holds a picture, illustration, etc.
Object name is gkq642i153.jpg should likewise be robust to the details of the folding program, and thus insensitive to low-level inaccuracies in specific predictions of base pairing. To test this, we compared predictions of An external file that holds a picture, illustration, etc.
Object name is gkq642i154.jpg made using mfold and RNAsubopt. As expected, we found that the values do not differ significantly between the two folding programs, and can thus be considered broadly robust to the specific characteristics of the energy model used (Table 1).

There is currently no published experimental work that directly measures the 5′–3′ distance of large (103–104 nt) ssRNAs in their native state (i.e. not complexed with proteins). However, based on a combination of experimental and computational approaches, Filomatori et al. (4) have proposed a model for the secondary structure of the exterior loop of native dengue ssRNA. Their proposed loop has a D-value of 25, which is of the same magnitude as both the theoretical predictions in Table 1, and the numerical predictions in Table 2.

DISCUSSION

We have made two predictions in the current work, both of which can be tested experimentally. First, we have predicted with general theoretical arguments—and demonstrated with numerical computations involving the equilibrated secondary structures of a large number of different lengths and sequences—that the distance between ends of an ssRNA (or ssDNA) should be ~10–15 nt links. This corresponds to a 3D physical distance of a few nm, which is far smaller than the contour lengths of large ssRNA molecules. As mentioned earlier, a crude estimate of the 3D distance between ends may be obtained in terms of the root-mean-square (RMS) end-to-end distance (An external file that holds a picture, illustration, etc.
Object name is gkq642i155.jpg) associated with a flexible linear polymer defined by the string of covalent and H-bonded links shown in Figure 2. With an average link size, An external file that holds a picture, illustration, etc.
Object name is gkq642i156.jpg, of ~3/4 nm, and a An external file that holds a picture, illustration, etc.
Object name is gkq642i157.jpg of 12, one obtains an RMS end-to-end distance of ~3 nm. This is approximately an order of magnitude less than the 37 nm average distance between nucleotides (radius of gyration) that has been measured by small-angle X-ray scattering for a 6400 nt viral ssRNA (29). Our estimate of 3 nm could be confirmed by fluorescence resonance energy transfer (FRET) measurements, or still more directly by cryo-EM imaging of large ssRNA molecules whose ends have been labeled by small gold particles (for example, 1 nm particles conjugated to oligonucleotides that are complementary to the 5′- and 3′-ends).

Second, we have predicted that all the linearized ssRNAs obtained by making a single cut in a long circular ssRNA molecule should have secondary (and hence) tertiary structures that are essentially identical to that of the parent circular form. Accordingly, they should have the same size and shape. And because they necessarily have the same charge, they should show virtually indistinguishable band positions in native gels, even though the linear and circular forms can be easily distinguished in denaturing gels where the secondary structure needed to effectively circularize the linear molecule has been destroyed. Similarly, under native conditions, small-angle X-ray scattering experiments, cryo-EM, and measurements of diffusion coefficients/hydrodynamic radii should show no difference between the circular and linearized molecules. The only caveat here, as well as for the measurements of 5′–3′ distance described earlier, is that the secondary structures of the molecules be equilibrated, since this is explicitly assumed in the theoretical arguments leading to all of these predictions [for a critical discussion of the equilibration/renaturation (and the lack thereof) of ssRNA, see Uhlenbeck (30)].

FUNDING

US National Science Foundation (grant number CHE07-14411 to W.M.G.); the Israel Science Foundation (grant number 695/06 to A.B.-S.); the US–Israel Bi-National Science Foundation (grant number 2006-401 to A.B.-S.); The Netherlands Organization for Scientific Research, Rubicon grant (to P.P.); and the University of California, Los Angeles, a Dissertation Year Fellowship (to A.M.Y.). Funding for open access charge: Research grant of A.B.-S. (grant number ISF 695/06).

Conflict of interest statement. None declared.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

Supplementary Data:

ACKNOWLEDGMENTS

We thank Li Tai Fang and Charles M. Knobler for many helpful discussions.

APPENDIX 1: DERIVATION OF D

Consider a particular secondary structure An external file that holds a picture, illustration, etc.
Object name is gkq642i158.jpg of a given circular ssRNA molecule, containing An external file that holds a picture, illustration, etc.
Object name is gkq642i159.jpg nucleotides and with base composition An external file that holds a picture, illustration, etc.
Object name is gkq642i160.jpg. Let An external file that holds a picture, illustration, etc.
Object name is gkq642i161.jpg denote the number of An external file that holds a picture, illustration, etc.
Object name is gkq642i162.jpg-loops (i.e. loops composed of An external file that holds a picture, illustration, etc.
Object name is gkq642i163.jpg unpaired nucleotides and An external file that holds a picture, illustration, etc.
Object name is gkq642i164.jpg duplexes) in this structure. Each An external file that holds a picture, illustration, etc.
Object name is gkq642i165.jpg-loop can be cut through any of its An external file that holds a picture, illustration, etc.
Object name is gkq642i166.jpg covalent bonds, yielding open exterior loops of An external file that holds a picture, illustration, etc.
Object name is gkq642i167.jpg links. The average effective contour length An external file that holds a picture, illustration, etc.
Object name is gkq642i168.jpg resulting from this cutting procedure is

equation image
(A1)

where the averages after the second equality are over all loops belonging to the particular structure. This follows from the fact that An external file that holds a picture, illustration, etc.
Object name is gkq642i169.jpg is the effective contour length of the exterior loop in a particular secondary structure, and An external file that holds a picture, illustration, etc.
Object name is gkq642i170.jpg is the statistical weight of An external file that holds a picture, illustration, etc.
Object name is gkq642i171.jpg-loops containing An external file that holds a picture, illustration, etc.
Object name is gkq642i172.jpg covalent bonds. An external file that holds a picture, illustration, etc.
Object name is gkq642i173.jpg, with An external file that holds a picture, illustration, etc.
Object name is gkq642i174.jpg denoting the fraction of An external file that holds a picture, illustration, etc.
Object name is gkq642i175.jpg-loops in this structure and An external file that holds a picture, illustration, etc.
Object name is gkq642i176.jpg denoting the total number of loops in this structure. The ‘marginal’ probability distribution An external file that holds a picture, illustration, etc.
Object name is gkq642i177.jpg is the fraction of loops containing An external file that holds a picture, illustration, etc.
Object name is gkq642i178.jpg unpaired nucleotides, regardless of the number of duplexes connected to these loops. Similarly, An external file that holds a picture, illustration, etc.
Object name is gkq642i179.jpg, etc. The sums over An external file that holds a picture, illustration, etc.
Object name is gkq642i180.jpg include all An external file that holds a picture, illustration, etc.
Object name is gkq642i181.jpg (An external file that holds a picture, illustration, etc.
Object name is gkq642i182.jpg corresponds to a bulge) yet we also note that, in the case of a hairpin (An external file that holds a picture, illustration, etc.
Object name is gkq642i183.jpg), energetic considerations generally imply An external file that holds a picture, illustration, etc.
Object name is gkq642i184.jpg. The sums over An external file that holds a picture, illustration, etc.
Object name is gkq642i185.jpg include all An external file that holds a picture, illustration, etc.
Object name is gkq642i186.jpg.

For long random sequences a simplified expression for An external file that holds a picture, illustration, etc.
Object name is gkq642i187.jpg [see Equation (2)], involving only An external file that holds a picture, illustration, etc.
Object name is gkq642i188.jpg, can be derived based on two reasonable approximations. The first is to assume there are no correlations between the distributions of unpaired and paired nucleotides in loops, i.e. An external file that holds a picture, illustration, etc.
Object name is gkq642i189.jpg, from which it follows that An external file that holds a picture, illustration, etc.
Object name is gkq642i190.jpg. Small deviations from this approximation may occur because, for hairpins, we generally have An external file that holds a picture, illustration, etc.
Object name is gkq642i191.jpg, whereas for other loops we have An external file that holds a picture, illustration, etc.
Object name is gkq642i192.jpg. The second approximation serves to relate An external file that holds a picture, illustration, etc.
Object name is gkq642i193.jpg to An external file that holds a picture, illustration, etc.
Object name is gkq642i194.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i195.jpg to An external file that holds a picture, illustration, etc.
Object name is gkq642i196.jpg. Here we assume that the distributions An external file that holds a picture, illustration, etc.
Object name is gkq642i197.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i198.jpg of, respectively, [the (1−fα)N] unpaired nucleotides and (An external file that holds a picture, illustration, etc.
Object name is gkq642i199.jpg) duplexes among the An external file that holds a picture, illustration, etc.
Object name is gkq642i200.jpg loops of structure An external file that holds a picture, illustration, etc.
Object name is gkq642i201.jpg, are random. These distributions (analogous to those of indistinguishable balls randomly distributed among boxes) are determined by maximizing the (entropy) functional An external file that holds a picture, illustration, etc.
Object name is gkq642i202.jpg (An external file that holds a picture, illustration, etc.
Object name is gkq642i203.jpg), subject to the normalization An external file that holds a picture, illustration, etc.
Object name is gkq642i204.jpg and conservation An external file that holds a picture, illustration, etc.
Object name is gkq642i205.jpg constraints. In this way we find An external file that holds a picture, illustration, etc.
Object name is gkq642i206.jpg, with a similar expression for An external file that holds a picture, illustration, etc.
Object name is gkq642i207.jpg. For concreteness and simplicity we set An external file that holds a picture, illustration, etc.
Object name is gkq642i208.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i209.jpg for the minimum values of An external file that holds a picture, illustration, etc.
Object name is gkq642i210.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i211.jpg, thus obtaining An external file that holds a picture, illustration, etc.
Object name is gkq642i212.jpg and An external file that holds a picture, illustration, etc.
Object name is gkq642i213.jpg. Similarly, An external file that holds a picture, illustration, etc.
Object name is gkq642i214.jpg, with the second equality following from the fact that, for all structures, An external file that holds a picture, illustration, etc.
Object name is gkq642i215.jpg. Equation (A1) now yields Equation (2) of the main text.

REFERENCES

1. Corver J, Lenches E, Smith K, Robison RA, Sando T, Strauss EG, Strauss JH. Fine mapping of a cis-acting sequence element in yellow fever virus RNA that is required for RNA replication and cyclization. J. Virol. 2003;77:2265–2270. [PMC free article] [PubMed]
2. Hsu M-T, Parvin JD, Gupta S, Krystal M, Palese P. Genomic RNAs of influenza viruses are held in a circular conformation in virions and in infected cells by a terminal panhandle. Proc. Natl Acad. Sci. USA. 1987;84:8140–8144. [PubMed]
3. Frey TK, Gard DL, Strauss JH. Biophysical studies of circle formation by sindbis virus 49S RNA. J. Mol. Biol. 1979;132:1–18. [PubMed]
4. Filomatori CV, Lodeiro MF, Alvarez DE, Samsa MM, Pietrasanta L, Gamarnik AV. A 5′ RNA element promotes dengue virus RNA synthesis on a circular genome. Genes Dev. 2006;20:2238–2249. [PubMed]
5. Ooms M, Abbink TEM, Pham C, Berkhout B. Circularization of the HIV-1 RNA genome. Nucleic Acids Res. 2007;35:5253–5261. [PMC free article] [PubMed]
6. Gallie DR. The cap and poly(A) tail function synergistically to regulate mRNA translational efficiency. Genes Dev. 1991;5:2108–2116. [PubMed]
7. Kneller ELP, Rakotondrafara AM, Miller WA. Cap-independent translation of plant viral RNAs. Virus Res. 2006;119:63–75. [PMC free article] [PubMed]
8. Miller WA, White KA. Long-distance RNA-RNA interactions in plant virus gene expression and replication. Annu. Rev. Phytopathol. 2006;44:447–467. [PMC free article] [PubMed]
9. Karetnikov A, Lehto K. Translation mechanisms involving long-distance base pairing interactions between the 5′ and 3′ non-translated regions and internal ribosomal entry are conserved for both genomic RNAs of blackcurrant reversion nepovirus. Virology. 2008;371:292–308. [PubMed]
10. Fabian MR, White KA. 5′–3′ RNA-RNA interaction facilitates cap- and poly(A) tail-independent translation of tomato bushy stunt virus mRNA: a potential common mechanism for Tombusviridae. J. Biol. Chem. 2004;279:28862–28872. [PubMed]
11. Cloutier TE, Widom J. DNA twisting flexibility and the formation of sharply looped protein-DNA complexes. Proc. Natl Acad. Sci. USA. 2005;102:3645–3650. [PubMed]
12. Grosberg AY, Khokhlov AR. Statistical Physics of Macromolecules. New York: AIP Press; 1994.
13. Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucl. Acids Res. 1981;9:133–148. [PMC free article] [PubMed]
14. Zuker M, Sankoff D. RNA secondary structures and their prediction. Bull. Math. Biol. 1984;46:591–621.
15. Fontana W, Konings DAM, Stadler PF, Schuster P. Statistics of RNA secondary structures. Biopolymers. 1993;33:1389–1404. [PubMed]
16. Yoffe AM, Prinsen P, Gopal A, Knobler CM, Gelbart WM, Ben-Shaul A. Predicting the sizes of large RNA molecules. Proc. Natl Acad. Sci. USA. 2008;105:16153–16158. [PubMed]
17. Hofacker IL, Schuster P, Stadler PF. Combinatorics of RNA secondary structures. Discr. Appl. Math. 1998;88:207–237.
18. Clote P, Kranakis E, Krizanc D, Stacho L. Asymptotic expected number of base pairs in optimal secondary structure for random RNA using the Nussinov–Jacobson energy model. Discr. Appl. Math. 2007;155:759–787.
19. Nussinov R, Jacobson AB. Fast algorithm for predicting the secondary structure of single stranded RNA. Proc. Natl Acad. Sci. USA. 1980;77:6309–6313. [PubMed]
20. de Gennes PG. Statistics of branching and hairpin helices for the dAT copolymer. Biopolymers. 1968;6:715–729. [PubMed]
21. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. Fast folding and comparison of RNA secondary structures. Monatsh. Chem. 1994;125:167–188.
22. Wuchty S, Fontana W, Hofacker IL, Schuster P. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers. 1999;49:145–165. [PubMed]
23. Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003;31:3406–3415. [PMC free article] [PubMed]
24. Mathews DH, Sabina J, Zuker M, Turner DH. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 1999;288:911–940. [PubMed]
25. Wiese KC, Glen E, Vasudevan A. JViz.Rna—a Java tool for RNA secondary structure visualization. IEEE T. Nanobiosci. 2005;4:212–218. [PubMed]
26. Waterman MS. Secondary structure of single-stranded nucleic acids. Adv. Math. Suppl. Stud. 1978;1:167–212.
27. Matsumoto M, Nishimura T. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. ACM T Model. Comput. Sci. 1998;8:3–30.
28. Mathews DH. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA. 2004;10:1178–1190. [PubMed]
29. Muroga Y, Sano Y, Inoue H, Suzuki K, Miyata T, Hiyoshi T, Yokota K, Watanabe Y, Liu X, Ichikawa S, et al. Small angle X-ray scattering studies on local structure of tobacco mosaic virus RNA in solution. Biophys. Chem. 2000;83:197–209. [PubMed]
30. Uhlenbeck OC. Keeping RNA happy. RNA. 1995;1:4–6. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press