|Home | About | Journals | Submit | Contact Us | Français|
We show on general theoretical grounds that the two ends of single-stranded (ss) RNA molecules (consisting of roughly equal proportions of A, C, G and U) are necessarily close together, largely independent of their length and sequence. This is demonstrated to be a direct consequence of two generic properties of the equilibrium secondary structures, namely that the average proportion of bases in pairs is ~60% and that the average duplex length is ~4. Based on mfold and Vienna computations on large numbers of ssRNAs of various lengths (1000–10000nt) and sequences (both random and biological), we find that the 5′–3′ distance—defined as the sum of H-bond and covalent (ss) links separating the ends of the RNA chain—is small, averaging 15–20 for each set of viral sequences tested. For random sequences this distance is ~12, consistent with the theory. We discuss the relevance of these results to evolved sequence complementarity and specific protein binding effects that are known to be important for keeping the two ends of viral and messenger RNAs in close proximity. Finally we speculate on how our conclusions imply indistinguishability in size and shape of equilibrated forms of linear and covalently circularized ssRNA molecules.
There are many situations in which it is biologically important for the two ends of a large RNA molecule to be close to each other. In animal viruses with single-stranded (ss) RNA genomes, for example, efficient replication of the genome has been shown to depend on its effective ‘circularization’. More explicitly, complementary sequences have been identified at or near the 5′- and 3′-ends that are responsible for forming ‘panhandles’ that keep the two ends close together. These panhandles are duplexes that are 21bp in the case of yellow fever virus (1), and 15bp in the case of influenza A (2), thereby according them unusual robustness. Another example where RNA genome circularization of this kind has been implicated in RNA replication is sindbis virus; here an 18bp 5′–3′ panhandle has been shown to survive denaturing conditions sufficient to eliminate much of the remaining secondary structure, leaving the genome with a circular appearance in electron micrographs (3). In dengue, also (like yellow fever, influenza A and sindbis) a positive-sense RNA virus, minus-strand synthesis involves long-distance 5′–3′ base pairing that facilitates the transfer of the RNA-dependent RNA polymerase from its binding site at the 5′-end to the initiation site at the 3′-end (4). Similarly, circularization of HIV-1 has been shown to arise from base pairing between the 5′- and 3′-ends of the RNA genome (5); these interactions are found to occur as well in different HIV-1 subtypes with large sequence variation, suggesting they share an evolutionary basis.
It has also long been known that effective circularization of messenger RNA molecules is important for efficient translation. The 5′- ‘capping’ and 3′-polyadenylation of mRNAs—through a variety of specific protein-binding events—result in the association of the two ends of the molecules and subsequent formation of translation initiation complexes (6). In eukaryotes, for example, the 3′-poly(A) ‘tail’ interacts with the poly(A)-binding protein, the 5′-G-cap binds a eukaryotic initiation factor, and these two bound proteins—with the full length of mRNA intervening—simultaneously bind a ‘bridging’ protein. This effective circularization of the molecule results in recruitment of the 40S ribosomal subunit (via binding of still another protein) and initiation of translation.
Because circularization of mRNA is so important for its translation, mechanisms that co-localize the ends have evolved even in cases where the molecules are not capped or polyadenylated. Plant viruses, for example, often lack both of these special sequences and yet are translated efficiently (7,8). The effective circularization is enhanced by direct base pairing between sub-sequences in the untranslated regions (UTRs) at the 5′- and 3′-ends; the UTRs functionally replace the G-cap and poly(A) tail. Further, the RNAs of many positive-sense (mRNA) viruses have internal ribosome entry sites (IRESs) at their 5′-ends, i.e. subsequences that recruit ribosomes and initiate translation (9,10).
In all of the above examples—involving both direct interaction between 5′- and 3′-ends or interaction mediated by binding proteins—particular, evolved, subsequences are involved in effective circularization. But in all of these scenarios, an even more fundamental requirement is that the two ends of the fluctuating molecule must spend enough time near each other in order for there to be a high probability for the special elements—RNA subsequences or binding proteins—to find one another. More explicitly, we will argue here that effective circularization of large RNA molecules is achieved through generic properties of secondary structure that are essentially independent of sequence. The specific evolved subsequences mentioned above are not needed so much for circularization as for facilitating the binding of particular proteins—e.g. RNA replicases and ribosome initiation factors—that are important for biological function of the circularized RNA.
Consider the analogous situation of double-stranded (ds) DNA with ‘sticky’ ends arising from complementary ss overhangs (generated, say, by a restriction enzyme). Here the probability of the two ends being covalently bound by a ligase is directly determined by—and ultimately limited by—the likelihood that they are close enough to each other to bind, i.e. that the double helix can twist and bend enough for its two ends to get close together (11). This classic problem is informed by the well-known statistical mechanical result giving the likelihood of the ends of a linear, semiflexible, polymer being within a monomer distance of one another. For sufficiently long molecules this probability is of order where and are the contour and persistence lengths, respectively, of the linear polymer; the contour length is the number of monomers times the average inter-monomer distance, and the persistence length is the distance along the chain contour beyond which the polymer can bend almost freely (12). Thus, the circularization probability of long DNA is small because is large, i.e. the molecule is long compared to its persistence length (50nm, for DNA): maximization of configurational entropy requires that the ends be far apart. The small probability of finding them close, decreasing as reflects directly the fact that the root-mean-square distance between the ends of the molecule is increasing as .
To understand the basis for effective circularization of ssRNA, then, it is natural to ask: is there, in analogy with dsDNA, a generic result for the probability of finding the two ends of an RNA molecule close to one another, and how different is it from that for a linear polymer? In this article we argue that there is indeed a universal distribution of end-to-end distances in large RNA molecules, and furthermore that it is essentially independent of overall sequence and length. We show in particular that the distance between ends is necessarily small, because of generic features of the secondary structure, notably that the percentage (f) of paired nucleotides (nt) is ~60% and that the average duplex length () is ~4. Using an early variant of the RNA folding algorithm developed by Zuker et al. (13,14), Fontana et al. (15) have calculated various characteristics of the minimum free energy (MFE) structure corresponding to several different types of short (20–100) nucleotide sequences. Averaging over many sequences of the same length (number of nucleotides, N) and base composition (), they found that and approach a constant value with increasing N. They also calculated a property (the number of unpaired bases in ‘joints’ and ‘free ends’) that is closely related to our definition of the 5′–3′ distance (see next section), finding that for the short chains analyzed this number increases, yet with a gradually decreasing slope, as increases. The constancy of and has been confirmed for a wide range of biological (viral and yeast) ssRNA sequences (16) by application of the mfold and Vienna codes for predicting thermally accessible secondary structures.
For certain models of polynucleotide chains, the -independence of and has been proven analytically, using a variety of powerful theoretical tools. Hofacker et al. (17), applying an elegant graph-theoretic approach, derived exact results for these properties (see their Table 3) and various other secondary structure attributes of RNA-like heteropolymers. Their results apply to an idealized ensemble where all possible secondary structures have equal statistical weight, resulting in low values of and . More recently, Clote et al. (18), using the Nussinov–Jacobson (‘maximum base pairing’) model (19) have shown that, for an ssRNA chain with Watson–Crick pairing rules, approaches a constant value slightly exceeding 90% for large (>1000). Earlier, de Gennes had noted (20) that, for a random sequence of two complementary nucleotides, the distance between chain ends remains finite even as approaches infinity. Based on this notion he also concluded that ‘…many properties of a large, open, strand are not very different from those of a cyclic strand of equal molecular length’ (20). We elaborate on this idea in the next section.
Our goal in the present work is to emphasize the generality of the proximity of the 5′- and 3′-ends of large RNA molecules of arbitrary length and sequence. Based on the general findings noted above for large ssRNA chains, we derive a simple expression for the 5′–3′ distance that can be evaluated numerically for sequences of given and . We also calculate this distance using the RNAsubopt (21,22) and mfold (23,24) folding algorithms. A further consequence of our analyses is that the secondary—and hence tertiary—structures of linear and covalently-circularized RNA molecules are practically identical. These conclusions are tested against several systematic calculations of secondary structures for specific linear and circular sequences, both random and viral.
Figure 1A displays the MFE secondary structure of a rather short (200nt) random-sequence ssRNA molecule, composed of equal numbers of A, C, G and U, as predicted by the mfold algorithm (23,24). The duplexes are represented in the usual way by straight ‘ladders’ and the loops by circles of different sizes. The same secondary structure is visualized slightly less schematically in Figure 1B, with more realistic scaling of duplex dimensions, using the jViz.Rna drawing program (25). This latter representation illustrates that the dangling ss segments in the ‘exterior loop’—the one including the 5′- and 3′-ends—are independent flexible chains. In Figure 1C the secondary structure is mapped into a tree graph, where each edge (bond) represents a duplex and the vertices represent the loops (15,17,26); the interior loops are denoted by solid circles, and the exterior loop by an open circle. The term ‘interior loop’ is conventionally defined as the chain of bases, both paired and unpaired, comprising a closed loop, excluding its closing (‘downstream’) base pair. In the following we slightly depart from this definition and include the closing base pair as part of the (hence closed) loop. Our definition of the exterior loop, which lacks a closing base pair, is identical to the conventional one, namely, it includes all bases (paired and unpaired) along the shortest connected (covalently or H-bonded) path from the 5′- to the 3′-end.
As a simple intuitive measure of the 5′–3′ distance (in a given secondary structure of a given sequence) we use the total number of nucleotide links comprising the exterior loop, i.e.
Here is the number of covalent (phosphodiester) bonds (hereafter also referred to as ss links) in the exterior loop and is the number of base-paired (H-bonded, ds) links in the exterior loop or, equivalently, the number of duplexes emanating from the exterior loop. As it is the total number of (ss and ds) links in the nucleotide chain constituting the exterior loop, we shall refer to as the ‘effective contour length’ of this loop. Expressing in the form where is the total number of nucleotides in the exterior loop, and noting that is the total number of paired bases in the exterior loop, it follows from Equation (1) that is the number of unpaired bases in this loop. Figure 2 illustrates an exterior loop where whereas in Figure 1 . It should be emphasized that the average physical distance between the 5′- and 3′-ends depends not only on but also on the specific sequence of the loop, as well as the number of duplexes branching from the loop. In fact the lengths of the covalent and H-bonded links are different (the latter are about three times larger). If all links were of equal length , and their joints were fully flexible, then the physical 5′–3′ distance would be roughly , where we have neglected excluded volume effects because of the shortness of the exterior loop (12). It follows that small, -independent, -values imply small, -independent physical distances between the two chain ends.
Four simple observations will guide our calculation of the 5′–3′ distance:
Two simple and important results can easily be proved from the tree graph analogy. First, the number of vertices, , and the number of bonds, , of a circular RNA are related by the equality . This relation is also valid for linear RNAs provided the exterior loop is also represented by a vertex (possibly differently labeled, as in Fig. 1C). Second, on average (over all loops in any given structure), each loop (vertex) is connected to duplexes (edges). For long () sequences we also find (see below), in which case we can safely set which (unless otherwise stated) will be the value used in our calculations. Note that the averaging here is over all loops in a given structure. The same holds, of course, after averaging over any number of structures and/or sequences. Note also that we always have , with corresponding to a ‘hairpin’ loop, to a ‘bubble’ or ‘bulge,’ and to a ‘multi loop’.
Among the numerous possible secondary structures of long RNA sequences, there are often thousands whose free energies are just marginally higher ( or less) than that of the MFE configuration, and under equilibrium conditions all these structures are nearly equally likely. Consequently, any property of the molecule that depends on its secondary structures should be averaged over their full thermal (Boltzmann) distribution. Suppose that, using RNAsubopt or a similar program, we have stochastically sampled the thermal ensemble of structures corresponding to a certain circular ssRNA sequence of given and . As argued in (i), above, all the linear ssRNA molecules derived by cutting any covalent (ss) bond in any interior loop of any member of the above ensemble will fold into ensembles of structures that are practically identical both to each other, and to the ensemble of the original circular molecule. The only difference is the appearance of an exterior loop, which now contains the 5′- and 3′-ends. For every given circular structure containing interior loops, this cutting procedure yields linear ssRNA sequences, where is the total number of ss (covalent) bonds in all loops of the given structure, denoting the number of covalent bonds in loop . Noting that the total number of nucleotides in the closed loop , namely is equal to the total number of bonds in this loop (), we find , with and denoting the number of unpaired and H-bonded nucleotides in loop , respectively, and the number of duplexes emerging from this loop. This yields . We have used the fact that the first sum is the total number of unpaired nucleotides, , and the fact that because every duplex is connected to two loops, the second sum is twice the total number () of duplexes in the structure. But can be expressed in the form so that . Here, and in all subsequent analytical expressions involving , its numerical value will be understood to be the fraction of bases in pairs, rather than the percentage. As before, denotes the average duplex length in the particular sequence considered. For and we find .
In the next section we present numerical calculations of the average 5′–3′ distance for two types of ssRNA molecules, biological (yeast-derived and viral) and randomly-permuted sequences. The random sequences were included both for direct comparison to the biological sequences, and for general theoretical interest. In each case, a Boltzmann-weighted average -value is determined for the thermal ensemble of structures associated with each sequence. We then report the mean of these ensemble-average -values for each set of sequences.
For the random sequences a simple theoretical prediction of (showing good agreement with the numerical calculation) can be derived based on two reasonable approximations, as argued in the Appendix 1. We show there that, for any given secondary structure of a very long () ssRNA molecule, the 5′–3′ distance is given by
with denoting the average number of ss covalent bonds per interior loop in the structure considered. In terms of the pairing fraction, , and duplex length, , of this structure we obtain . For both the MFE structure and the canonical ensemble averages of secondary structures of random (but also viral) sequences containing roughly equal proportions of the four bases it is found that and , yielding , and hence . See also Table 1.
Randomly-permuted ssRNA sequences were generated with a Fisher–Yates shuffle driven by a Mersenne Twister random number generator (27) implemented in C++ (by R. Wagner, University of Michigan, available at: www-personal.umich.edu/~wagnerr/MersenneTwister.html). Viral ssRNA sequences were obtained from the National Center for Biotechnology Information Genome Database (www.ncbi.nlm.nih.gov). Yeast (Saccharomyces cerevisiae) genomic sequences were obtained from the Saccharomyces Genome Database (www.yeastgenome.org).
Secondary structure predictions were made with two RNA folding programs, RNAsubopt, a program in the Vienna RNA Package, Version 1.7 (21,22), and mfold, Version 3.1 (23,24). These programs employ detailed empirically-based energy models to estimate the free energies of the non-pseudoknotted secondary structures that are formed by a specified ssRNA sequence. With RNAsubopt, it is possible to sample stochastically from the ensemble of secondary structures, with a sampling probability in proportion to each structure’s Boltzmann weight. Thus, sampling a sufficient number of structures (we use 1000), and averaging the -values for this set, gives a close approximation to the ensemble-average predicted value of the end-to-end distance for that sequence. In earlier work (16) we demonstrated that the average properties of subsets of 1000 structures are not significantly different from those of the complete ensemble of structures. More generally, for any property , its RNAsubopt-predicted ensemble-average value is calculated as , where is its value in the member of the stochastically-generated subset of the Boltzmann ensemble of secondary structures. In mfold, by contrast, an algorithm is used to generate a structurally diverse representation of the ensemble, rather than a thermally-representative average. We configured mfold to generate the 1000 lowest-energy structures from such a set, measured for each, and averaged them in proportion to their Boltzmann weights, to give an mfold-averaged -value. For any property , its mfold-predicted average value is with the free energy of the secondary structure relative to the MFE for that sequence.
While there can be significant inter-taxon variation, the average composition, , of the viral RNAs in this study is ~24% G, 22% C, 26% A and 28% U (16). With this ‘viral-like’ , we generated 2000 random sequences of lengths 50, 100, 200 and 400; 1000 of lengths 800 and 1500; 500 of lengths 2000, 2500, 3000 and 4000; 300 of lengths 5000, 6000 and 7000; and 1000 of length 8000. These sequences were folded with RNAsubopt. Figure 3 shows the mean and standard deviation for each length of RNA, and a regression line fitted to sequences of length 400 and greater. Except for the very short sequences, is ~12, independent of sequence length; in addition, it is relatively insensitive to small changes in . That this -value is identical to the estimate obtained above, through the theoretical calculation, is coincidental, because the latter is based on the somewhat approximate expression given in Eq. (2) (the approximations are explained in Appendix 1). But it is nevertheless very striking, and highly significant, that the simple theory predicts a -value that is of the correct magnitude and that is independent of length and sequence.
Table 1 shows the results for 500 3000-nt ssRNAs of viral-like and uniform , as well as 500 ssRNAs that are the transcripts of consecutive 3000bp sections on yeast (S. cerevisiae) chromosomes XI and XII. In these sets, the values of , and (averaged over the 500 sequences) were 12–14, ~60% and ~4, respectively. The last column in the table lists the values of calculated according to Equation (2), and these results are seen to agree closely with those from the detailed numerical calculations (especially for the random sequences, as expected).
The viral taxa analyzed are listed in Table 2. All are non-enveloped ssRNA viruses and, except for the rod-shaped Tobamoviruses, have icosahedral capsids. The Leviviridae infect bacteria, the Astroviridae and Caliciviridae are animal viruses, and the remainder infect plants. The Bromoviridae are, in addition, tripartite: the genome consists of three ssRNAs, divided among three separate capsids. The number of sequences analyzed in each case corresponds to the number of species considered.
From Figure 3 it can be seen that the values and standard deviations of D for the viral RNAs are higher, but overlap those of the random sequences for all taxa except the Tymoviruses. The latter can be understood from the fact that small -values are an inherent consequence of base pairing; all non-pathological secondary structures with a sufficiently high percentage of bases in pairs, , will have a low . The Tymoviruses show a relatively larger (although still small relative to sequence length) because they have a significantly smaller .
We note that current RNA folding programs have been shown to be limited in their ability to correctly predict individual base pairs in long ssRNA sequences (28). Consistent with this, RNAsubopt and mfold (which use slightly different energy models to generate their ensembles of secondary structures, and different algorithms to sample from these ensembles), when given long sequences to fold, output structures that often show significant differences in the details of base pairing, as well as overall appearance. However, our simple theoretical model predicts that depends only on the values of and , which we have previously found to be robust with respect to the details of the folding program used (16). Consequently, should likewise be robust to the details of the folding program, and thus insensitive to low-level inaccuracies in specific predictions of base pairing. To test this, we compared predictions of made using mfold and RNAsubopt. As expected, we found that the values do not differ significantly between the two folding programs, and can thus be considered broadly robust to the specific characteristics of the energy model used (Table 1).
There is currently no published experimental work that directly measures the 5′–3′ distance of large (103–104nt) ssRNAs in their native state (i.e. not complexed with proteins). However, based on a combination of experimental and computational approaches, Filomatori et al. (4) have proposed a model for the secondary structure of the exterior loop of native dengue ssRNA. Their proposed loop has a D-value of 25, which is of the same magnitude as both the theoretical predictions in Table 1, and the numerical predictions in Table 2.
We have made two predictions in the current work, both of which can be tested experimentally. First, we have predicted with general theoretical arguments—and demonstrated with numerical computations involving the equilibrated secondary structures of a large number of different lengths and sequences—that the distance between ends of an ssRNA (or ssDNA) should be ~10–15nt links. This corresponds to a 3D physical distance of a few nm, which is far smaller than the contour lengths of large ssRNA molecules. As mentioned earlier, a crude estimate of the 3D distance between ends may be obtained in terms of the root-mean-square (RMS) end-to-end distance () associated with a flexible linear polymer defined by the string of covalent and H-bonded links shown in Figure 2. With an average link size, , of ~3/4nm, and a of 12, one obtains an RMS end-to-end distance of ~3nm. This is approximately an order of magnitude less than the 37nm average distance between nucleotides (radius of gyration) that has been measured by small-angle X-ray scattering for a 6400nt viral ssRNA (29). Our estimate of 3nm could be confirmed by fluorescence resonance energy transfer (FRET) measurements, or still more directly by cryo-EM imaging of large ssRNA molecules whose ends have been labeled by small gold particles (for example, 1nm particles conjugated to oligonucleotides that are complementary to the 5′- and 3′-ends).
Second, we have predicted that all the linearized ssRNAs obtained by making a single cut in a long circular ssRNA molecule should have secondary (and hence) tertiary structures that are essentially identical to that of the parent circular form. Accordingly, they should have the same size and shape. And because they necessarily have the same charge, they should show virtually indistinguishable band positions in native gels, even though the linear and circular forms can be easily distinguished in denaturing gels where the secondary structure needed to effectively circularize the linear molecule has been destroyed. Similarly, under native conditions, small-angle X-ray scattering experiments, cryo-EM, and measurements of diffusion coefficients/hydrodynamic radii should show no difference between the circular and linearized molecules. The only caveat here, as well as for the measurements of 5′–3′ distance described earlier, is that the secondary structures of the molecules be equilibrated, since this is explicitly assumed in the theoretical arguments leading to all of these predictions [for a critical discussion of the equilibration/renaturation (and the lack thereof) of ssRNA, see Uhlenbeck (30)].
US National Science Foundation (grant number CHE07-14411 to W.M.G.); the Israel Science Foundation (grant number 695/06 to A.B.-S.); the US–Israel Bi-National Science Foundation (grant number 2006-401 to A.B.-S.); The Netherlands Organization for Scientific Research, Rubicon grant (to P.P.); and the University of California, Los Angeles, a Dissertation Year Fellowship (to A.M.Y.). Funding for open access charge: Research grant of A.B.-S. (grant number ISF 695/06).
Conflict of interest statement. None declared.
Supplementary Data are available at NAR Online.
We thank Li Tai Fang and Charles M. Knobler for many helpful discussions.
Consider a particular secondary structure of a given circular ssRNA molecule, containing nucleotides and with base composition . Let denote the number of -loops (i.e. loops composed of unpaired nucleotides and duplexes) in this structure. Each -loop can be cut through any of its covalent bonds, yielding open exterior loops of links. The average effective contour length resulting from this cutting procedure is
where the averages after the second equality are over all loops belonging to the particular structure. This follows from the fact that is the effective contour length of the exterior loop in a particular secondary structure, and is the statistical weight of -loops containing covalent bonds. , with denoting the fraction of -loops in this structure and denoting the total number of loops in this structure. The ‘marginal’ probability distribution is the fraction of loops containing unpaired nucleotides, regardless of the number of duplexes connected to these loops. Similarly, , etc. The sums over include all ( corresponds to a bulge) yet we also note that, in the case of a hairpin (), energetic considerations generally imply . The sums over include all .
For long random sequences a simplified expression for [see Equation (2)], involving only , can be derived based on two reasonable approximations. The first is to assume there are no correlations between the distributions of unpaired and paired nucleotides in loops, i.e. , from which it follows that . Small deviations from this approximation may occur because, for hairpins, we generally have , whereas for other loops we have . The second approximation serves to relate to and to . Here we assume that the distributions and of, respectively, [the (1−fα)N] unpaired nucleotides and () duplexes among the loops of structure , are random. These distributions (analogous to those of indistinguishable balls randomly distributed among boxes) are determined by maximizing the (entropy) functional (), subject to the normalization and conservation constraints. In this way we find , with a similar expression for . For concreteness and simplicity we set and for the minimum values of and , thus obtaining and . Similarly, , with the second equality following from the fact that, for all structures, . Equation (A1) now yields Equation (2) of the main text.