|Home | About | Journals | Submit | Contact Us | Français|
RNA structure is critical for gene regulation and function. In the past, transcriptomes have been largely parsed by primary sequences and expression levels, but it is now becoming feasible to annotate and compare transcriptomes based on RNA structure. In addition to computational prediction methods, the recent advent of experimental techniques to probe RNA structure by deep sequencing has enabled genome-wide measurements of RNA structure, and provided the first picture of the structural organization of an eukaryotic transcriptome—the “RNA structurome”. With additional advances in method refinement and interpretation, structural views of the transcriptome should help to identify and validate regulatory RNA motifs that are involved in diverse cellular processes, and thereby increase understanding of RNA function.
RNA is a unique informational molecule. In addition to carrying information in their linear sequences of nucleotides (primary structure), RNA molecule fold into intricate shapes. Pairing of local nucleotides can create secondary structures such as hairpins and stem loops, and interaction among distantly located sequences can further create tertiary structures. In every step of its life cycle, RNA structures influence the transcription, splicing, cellular localization, translation, and turnover of the RNA (Fig. 1). The topic of RNA structures in different cellular processes have been covered in several excellent reviews1–5. Although the structures of multiple RNAs have been studied in detail, structural information for most RNAs in cell, such as mRNAs, is missing due to the low throughput nature of RNA structure probing and the difficulty in probing long RNAs. Classic techniques require individually cloned RNA sequences, and only a few hundred bases can be interrogated per experiment. As most of the RNA structures are studied on a case-by-case basis, it is difficult to determine what the full impact an RNA’s structure has on cellular biology. To close this gap, genome-wide RNA structure determination has relied heavily on computational predictions to create structural models for hypothesis testing. Computational RNA prediction algorithms have advanced greatly in their ability to predict more accurate secondary structures from both primary sequences and sequence covariation. However, these predicted structures are typically confirmed by secondary structure probing, which still serves as the gold standard of RNA structure determination.
The advent of ultra high throughput sequencing technologies has enabled the sequencing of hundreds of millions of bases at a time, and greatly increased the speed and precision of genomic data. High throughput sequencing has been applied successfully in many applications, including genome discovery, transcriptome annotation, and global mapping of DNA-protein interactions6–8. Coupling RNA structure probing to high throughput sequencing yields genome-scale RNA structural information, providing insights to the secondary structures of thousands of transcripts in the cell. Here, we briefly summarize the importance of RNA structure in various cellular processes by highlighting a few recently discovered examples, review advances in computational structure predictions, focus on experimental approaches to large-scale RNA structure maps, and discuss the potential impact of this new kind of transcriptomic information.
RNA secondary and tertiary structures influence the function of almost all classes of RNAs, including mRNAs, non-coding RNAs such as riboswitches, ribozymes, long non-coding RNAs (lncRNA) and microRNAs (miRNA). RNA structures play roles in nearly every step of gene expression from transcription, mRNA processing, RNA localization, translation, to RNA decay (Table 1). RNA structures enable RNA to interact with itself, with other RNAs, with ligands and with RNA binding proteins. Many of these structures can exert their influence by helping to provide specific binding sites for RNA binding proteins (RBP) as well as restricting protein binding by altering accessibility. Identifying RBP binding sites and RBP consensus motifs is an area of intense study (Box 1).
RNA binding proteins (RBP) interact with RNAs to regulate diverse cellular processes. While many of these interactions are mediated by linear sequence motifs, RNA structural motifs as well as the structure context in which linear motifs are embedded also influence RBP binding. Different strategies have been developed to identify RNA consensus motifs. Transcripts associated with RBPs can be computationally searched for consensus nucleotide sequences that are selectively enriched in bound versus un-bound transcripts using programs such as MEME, FIRE and REFINE141–143. Experimentally, Selex and RNAcompete enable the determination of RNA consensus motifs experimentally by incubating an RBP with a complex pool of randomized short RNA sequences to selectively identify the sequences that have stronger binding affinities to the RBP142, 144. The development of new methods such as High-throughput sequencing of RNA isolated by crosslinking immunoprecipitation (HITS-CLIP) and Photoactivatable ribonucleoside enhanced crosslinking and immunoprecipitation (PAR-CLIP) allow the identification of both RBP bound transcripts as well as the protein binding site, greatly reducing the search space for consensus motif finding in RBP bound targets24, 25. Importantly, incorporation of predicted RNA secondary structure can substantially increase the explanatory power of some linear RBP binding motifs; for instance, several motifs are shown to bind RBP only when the motif occurs in the context of a single-stranded, accessible region of mRNA145. Combined with an increased amount of available RNA structure data, it would be possible to predict consensus RNA structural motifs and assess the impact of RNA structures in RNA protein interactions.
Multiple RNA structures can potentially be formed from a long linear sequence. RNA structures are frequently dynamic and RNAs can undergo different conformational changes based on their solvent conditions. RNAs can react to various inputs including differences in protein binding, changes in ligand and salt concentrations, and varying temperatures to result in gene expression changes, providing an additional layer of complexity to gene regulation. This role of RNA as a molecular sensor requires that RNA structures are highly specific, so that distinct RNA structures can respond to specific cellular stimuli, and that RNA structures are dynamic, so that the cellular response is fairly rapid. Below we elaborate on some examples that illustrate the specificity and dynamic character of RNA structures and how identifying such structures in a transcriptome-wide manner can enhance our understanding on RNA function.
One of the best examples that illustrates the specificity and the dynamics of RNA structures is a riboswitch. Riboswitches are RNAs sensors that can detect changes in cellular stimuli in the absence of other cofactors such as proteins5, 9. As such, some of the first riboswitches were discovered based on changes in RNA structure induced by specific ligands5, 10; sequence alignment with established riboswitches allowed subsequent identification of riboswitch families11, 12.
A riboswitch typically consists of two domains, an aptamer domain that recognizes its specific ligand and an expression domain. Upon interacting with a ligand in its ligand binding domain, the riboswitch undergoes a conformational change that results in gene expression changes. Multiple classes of riboswitches exist that respond to a wide range of cellular stimuli including amino acids, nucleotides, metal ions, coenzymes and temperature to regulate processes such as transcription termination, changes in translation rate, splicing and mRNA decay13–16. Although first discovered in bacteria, riboswitches have been found in other organisms such as yeast, algae and plants, indicating the prevalence of this important regulatory mechanism in multiple kingdoms of life17, 18. However, only the thiamine pyrophosphate (TPP) riboswitch has been found outside eubacteria, and none has been found in mammals12.
Because the aptamer domains of riboswitches form multiple Watson and crick bases with their ligands, riboswitches are typically very specific for their ligands and can discriminate between their true ligands and other similarly structured molecules5. This specificity of its metabolite enables a riboswitch to serve as a cellular sensor. An example of this is the adenine riboswitch whereby a single base pair change from U to C in the ligand binding site changes the affinity of the riboswitch for adenine to guanine19. This riboswitch is found in the 5′UTR of the ydhL mRNA and forms a secondary structure upon binding to adenine that prevents the formation of the terminator loop and transcription termination. High levels of adenine hence result in high protein levels of ydhL, which is a purine efflux pump, to pump purines out of the cells. Another example is the SAM riboswitch. Distinct classes of the SAM riboswitches can bind to S-adenosylmethione (SAM), a coenzyme for methylation, or S-adenosylcytosine (SAH), a byproduct of the methylation reaction, even though SAM and SAH are highly similar in structure (Fig. 1A). This distinction is important to prevent the accumulation of toxic SAH and to recycle SAH to form SAM5. The diversity of SAM riboswitches also illustrates the possibility of multiple RNA structural solutions to the same biochemical challenge, raising the need to experimentally probe RNA structural dynamics rather than relying purely on sequence conservation.
The dynamics of RNA structure is also a recurring theme in mammalian RNAs. While the binding of protein factors to specific RNA elements has been extensively studied, it is recently emerging that this binding can result in a corresponding change in RNA structure, which affects gene expression. The VEGFA mRNA contains a 125 base, hypoxia stability region, in its 3′UTR and the structure of this region changes depending on whether the cell is exposed to normoxic or hypoxic conditions in the presence of interferon gamma20. During normoxia, the presence of the GAIT complex results in the VEGFA mRNA to form a structure that is not permissive to translation. However during hypoxia, the binding of HNRNPL results in the RNA conformation to switch to a different structure that permits protein translation.
MicroRNAs (miRNAs) are ~23-nt short RNAs that modulate gene expression in normal development and disease pathogenesis. Recently, RNA conformations within a transcript have also been found to be one of the determinants of whether a transcript is targeted by specific miRNAs. The interaction between miRNAs and 3′ UTRs of their targets can lead to mRNA destabilization and/or translation inhibition. Accessibility of miRNA target sites can influence miRNA binding, as target sites that are buried in secondary structures may sterically hinder their interaction with miRNAs21. Interestingly, accessibility of miRNA target sites can change in different biological states indicating an additional layer of gene regulation22. One prime example is the regulation of levels of p27, a cyclin dependent kinase inhibitor, during different stages of the cell cycle. p27 protein level is low in dividing cells but high in non-dividing, quiescent cells. Upon growth factor stimulation, Pumilio-1 protein is activated, binds to the p27 mRNA 3′ UTR, and results in a RNA structural change. This structural change exposes the microRNA target sites in the 3′ UTR of p27, allowing miR-221 and miR-222 to interact with the p27 3′ UTR, causing translation repression and a reduction in p27 protein levels (Fig. 1B).
There is an increasing amount of genome-wide datasets on RNA binding proteins and their targets, as well as where these proteins bind to their mRNA targets23–25. Probing RNA structures in a genome-wide manner both in-vitro and in-vivo would enable us to study both the structural context that determines protein binding to RNAs as well as identify regions of RNA structural changes that occur in the presence and absence of protein binding. As many of such structural changes result in meaningful functional outputs, such as changes in translation or decay, this would enrich our mechanistic understanding of how RNA structures impact cellular function.
Given the experimental difficulties in measuring RNA structure, algorithms for predicting RNA structure from primary sequence have been developed and applied in many settings26–31. When accurate, these approaches have clear advantages, as they do not require experimentation, and can also be used to predict the structure of any arbitrary transcript, including hypothetical transcripts with designed mutations. Indeed, approaches based on computational predictions have led to many biological discoveries and insights. For example, for specific classes of ncRNAs whose members share structural properties essential for their function, computational methods utilizing secondary structure predictions were successfully used to annotate new members of that ncRNA class. Examples include methods for predicting tRNAs32, 33, snoRNAs32 and microRNAs34. By combining RNA structure predictions with comparative genomic analysis, the more general task of identifying novel ncRNAs from a genome sequence has also been addressed in many organisms35–37. Finally, several methods have been developed for identifying structural motifs that are common to multiple RNAs, and that may have a role in the subcellular localization, stability, or the function of the RNA in which they are embedded30, 38–41 (Fig. 2a,b).
Several different approaches exist for predicting RNA secondary structure. Methods based on comparative sequence analysis rely on the fact that many of the known functional RNA structures are conserved in evolution. Examples include tRNAs, rRNAs, and group I and group II introns42, 43. Covariation methods determine secondary structure by examining conservation patterns of basepairs among homologous or paralogous genes. Such covariation methods search for two distinct genomic sequences in which evolutionary sequence changes in one sequence are accompanied by compensatory sequence changes in the other sequence that preserve RNA structure42. For example, the pairing of G-C nucleotides between two distinct genomic sequences can be maintained at the structure level in another species if the G-C nucleotides have changed to A-U nucleotides (Fig. 2c). The structure can be determined directly from the pattern of conserved pairings when enough homologous sequences are available, and several methods exist for this27, 44–47. In other cases, a combined thermodynamic-covariation method can be used48.
When only a single sequence is available, an accurate and popular method is thermodynamic computation of the minimal free energy structure. This method uses efficient dynamic programming algorithms in conjunction with experimentally-derived energy parameters to scan the entire landscape of possible secondary structure configurations and identify the most thermodynamically stable structure26, 49, 50. For sequences that are shorter than 700bp, ~70% of the known basepairs are correctly predicted by these methods. However for longer sequences, the accuracy drops to ~20–60% when the predicted structures are compared to high resolution crystal structures and structural predictions obtained using comparative analysis51, 52. As an alternative to free energy minimization methods, algorithms based on probabilistic modeling using stochastic context-free grammars (SCFGs) were also developed, but since their accuracy is lower, thus far they have not replaced free energy minimization methods28. Another recent improved strategy was developed using both thermodynamic modeling and machine learning methods, and the strategy was based on choosing the nucleotide set with the maximal sum of pairing probabilities53, 54. An interesting application of thermodynamic modeling techniques is the evaluation of potential RNA structural changes caused by noncoding single nucleotide polymorphisms associated with human diseases. Laederach and colleagues identified multiple disease-associated mutations in UTRs that alter the mRNA structural ensemble of the associated gene, providing new hypotheses for causes of human disease and variation55.
Another successful approach has been to incorporate experimentally derived structural information into computational predictions. This approach has been in use since the first prediction algorithms became available and has been further developed throughout the years29, 56–59. When the experiment can only derive binary information for each nucleotide, namely whether the nucleotide was paired or unpaired, the dynamic programming algorithm can be modified such that large positive free energy terms are added to nucleotides that are known to be unpaired, thus restricting the algorithm from marking them as paired57. More recently, methods that use quantitative, nucleotide resolution experimental data (discussed below) to direct the prediction of a folding algorithm have been introduced, by integrating an additional per nucleotide pseudo-free energy term into the dynamic programming algorithm59. This method was shown to significantly increase the accuracy of structure prediction.
Despite their many successes, current prediction algorithms have several limitations. First, RNA molecules in solution may adopt secondary structures that are only partially determined by thermodynamics, as RNA molecules can undergo conformational changes upon interaction with other RNAs and RNA-binding proteins. These context-dependent RNA protein interactions are extremely complex to model and are thus excluded from all prediction algorithms. Second, although our knowledge of thermodynamic rules and parameters has greatly improved, it is far from being complete29, 57, 60, 61. Finally, most folding algorithms use approximations in order to efficiently scan the vast landscape of possible secondary structures. Important limitations are the difficulty to predict pseudoknots (RNA topologies that contain non-nested nucleotide pairings) or take into account long-range model and tertiary structure interactions. Pseudoknots have been observed in a number of functional RNA sequences, such as ribosomal RNAs (rRNAs), transfer RNAs (tRNAs) or the genomes of viral RNAs62, where they have been shown to be involved in unique mechanisms of viral translation initiation and elongation63. Thus, ignoring pseudoknots results in inaccurate structure predictions62, 64. In contrast to the prediction of nested structures (free of pseudoknots), which can be efficiently solved using dynamic programming, predicting structures that contain pseudoknots is very challenging computationally. Pseudoknot prediction has proven to be a class of computational problems with no fast solutions, termed “NP-complete”, for a large class of models of pseudoknots65. As a result, several methods have been developed that focus on specific types of pseudoknots66–68, or employ heuristics69–73, to bring running time to down. Nonetheless, computational prediction of pseudoknot still scales exponentially with the length of the RNA [on the order of O(n^4)-O(n^6) where n is the length of the RNA sequence].
Thus, although the extensive research and development of RNA structure prediction tools has led to many successes and discoveries, the applicability of existing tools is still limited and further experimental data is needed to bridge the knowledge gap. However, the accumulation of additional experimental data should lead to better optimization of existing algorithms and to the development of new strategies, some of which may combine experimental and computational approaches.
RNA footprinting is a method that probes RNA in solution using a variety of chemical and enzymatic probes74. With in vitro footprinting, an RNA of interest is typically transcribed in vitro and folded in solution before being subjected to a battery of different structural probes that determine which of the bases are single stranded, double stranded, or solvent exposed74, 75. Chemicals including dimethyl sulfide (DMS), 1-cyclohexyl-(2-morpholinoethyl)carbodiimide metho-p-toluene sulfonate (CMCT), kethoxal, lead (Pb2+) and N-methylisatoic anhydride (NMIA), and nucleases including RNase I, T1, A and S1 nuclease, interact with single stranded or flexible bases to modify or cleave them76–80; enzymes such as RNase V1 recognize and cleave at double stranded bases81; hydroxyl radicals cleave at RNA bases that are solvent exposed82, 83. The combinatorial usage of the above probes provides structural information on most bases in the RNA. Upon cleavage or modification, the reaction sites can be detected by autoradiography, or alternatively reverse transcription, followed by gel or capillary electrophoresis (Fig. 3). The location of the cleavage is determined from the migration pattern of the bands and the intensity of the bands can be quantified using image processing tools, such as the program semiautomated footprinting analysis (SAFA)84.
RNA footprinting can also be performed in vivo85, 86. Because some RNAs are able to fold into alternative conformations in vitro that do not reflect their in vivo biological conformations, structure probing in vivo may provide more accurate information on biologically relevant RNA structures87. RNA footprinting can be carried out inside the cells using chemicals that can penetrate the cell membrane such as lead and DMS, or with high energy X-rays82, 86, 88. Lead probing has been successfully applied to in vivo structure probing in bacteria while DMS has been applied to both prokaryotic and eukaryotic cells86, 88. However, in vivo RNA footprinting may not be able to interrogate all regions of a RNA of interest due steric hindrance from protein interactions. The dynamic cellular environment also presents RNA in heterogeneous states: RNA in different stages of its lifecycle during transcription, translation and decay are all present. Averaging the structural signal from heterogeneous states may also prove to be inaccurate. As such, structural probing in vitro and in vivo provide complementary information about RNA structures. In all footprinting experiments, it is important to titrate the amount of structural probe used to single hit kinetics such that on average, the RNA of interest is only cleaved once per molecule. This ensures that the footprinting is performed on the original folded RNA, instead of on RNA that has refolded incorrectly after it has been cleaved.
Application of capillary electrophoresis to RNA structure probing is an important step in increasing the throughput of RNA structure data. Although RNA probing in solution can be readily implemented for short RNAs, probing of long RNAs can be challenging. Gel electrophoresis typically resolve about a hundred bases of RNA at a time and hence probing an RNA of several kilobases long would require running tens to hundreds of gels. Capillary electrophoresis allows the resolution of 300–650 bases from a structure probing experiment and multiple lanes can be run at the same time to increase its throughput of RNA structure probing89, 90. The readout of the probing experiment is typically through the reverse transcription of a 5′ fluorescently labeled DNA primer that anneals specifically to the RNA of interest. If the RNA is several kilobases long, multiple primers are designed to anneal along the length of the transcript. Modification or cleavage of the RNA template results in premature stops in the primer extension reaction, leading to different lengths of the cDNA product which are resolved by capillary electrophoresis. Software tools such as CAFA and Shapefinder can automate the data acquisition from capillary electrophoresis and further improve speed and accuracy89, 90 (Fig. 3).
The method SHAPE uses the chemical NMIA and its derivatives to interrogate flexible regions in RNA secondary structure80. The 2′ OHs of flexible bases are able to orient themselves more readily for attack by the electrophile NMIA, resulting in the formation of 2-O adducts. These 2-O adducts can be detected by reverse transcription and capillary electrophoresis. As every ribonucleotide contains a 2′ OH, SHAPE has the advantage of being able to probe most bases in an RNA. With the coupling of SHAPE to capillary sequencing, SHAPE has been applied to interrogate the secondary structures of long RNAs, such as the 16S rRNA and the RNA genome of the human immunodeficiency virus (HIV)59, 91, 92.
The construction of the secondary structure of the HIV genome using SHAPE was a landmark that demonstrates the substantial value of comprehensive RNA structure analysis91. The HIV genome is a 9kb long single stranded RNA that encodes nine open reading frames that are translated into fifteen proteins important for HIV infection and replication. Initial probing of the first 900 bases of the HIV genome across four different biological states showed highly similar secondary structures in virio and ex virio92. Regulatory regions within the 900 bases are found to be more structured than protein coding regions, and multiple regions within the RNA are found to interact with the nucleocapsid proteins. Structure probing of the entire 9kb HIV genome ex virio by SHAPE further found numerous regions within the genome that have functional roles in HIV replication91. These structured RNA domains provide insights into Gag-Pol frame-shifting, hyper-variable domains, and translocation of the Env protein. Interestingly, the nucleotides that encode for loops between independently folded protein domains are more structured than their surrounding bases, and are able to fold into secondary structures that retard the mobility of ribosomes for co-translational protein folding of modular domains91.
Coupling RNA footprinting, such as SHAPE, to capillary sequencing has opened the door to structure probing of large RNAs, and it is likely that more RNA genomes, such as the polio virus and HCV virus, will be structurally probed to understand the role of RNA structures in viral replication. Furthermore, RNA structure probing is likely to extend beyond the probing of a single viral genome to families of viral genomes, to discover conserved or rapidly evolving structural elements that are likely to be functionally important in viral biology or pathogenicity. To facilitate this, the throughput of RNA structure probing can be greatly enhanced by coupling RNA footprinting to high throughput sequencing, which provides orders of magnitude of more sequencing information than capillary sequencing.
The application of next-generation sequencing allowed the next major advance in genome-wide measurements of RNA structure, since millions of sequence reads can be obtained in a single experiment (Fig. 4). Cleavages or modifications at double or single stranded bases from structure probing can be captured and converted into cDNA libraries that are sequencing compatible. These sequencing reads are mapped back to the genome or the transcriptome to identify the transcript and the locations along the transcript that the cleavages occurred. The intensity of the cleavage at a base can also be calculated by summing the reads that are mapped to the base. This strategy allows the simultaneous identification of double or single stranded/flexible bases in thousands of RNAs in one experiment. In a strategy termed Parallel Analysis of RNA structure (PARS), deep sequencing reads of double- or single-stranded regions of RNAs generated by RNase V1 and S1 nuclease respectively are compared21. An alternative strategy, named Fragmentation sequencing (Frag-seq), quantifies deep sequencing reads generated specifically by RNase P1, a single-strand specific nuclease93.
Using PARS, Kertesz et al. measured the secondary structure of the yeast transcriptome, generating structural information on ~4.2 million bases in over 3000 yeast transcripts21. Mapping PARS scores to known structures of regulatory motifs, such as Ash1 localization elements (required to properly localize Ash1 mRNAs to the yeast bud tip) and the internal ribosomal entry site of URE2 mRNA, indicates that PARS is able to capture the structural information in these elements, demonstrating the utility of this high throughput data. The large amount of PARS data provides insights into the global structural organization of mRNAs, including the presence of more secondary structure in coding regions as compared to untranslated regions, a three-nucleotide periodicity of secondary structure along the coding regions and an anti-correlation between mRNA translation efficiency and structure over mRNA translation start site (Fig. 5). Using Frag-seq, Underwood et al. correctly reconstructed the secondary structure of snoRNAs in mouse cells93. Both Frag-seq and PARS data can be integrated into structure prediction programs for more accurate RNA secondary structure prediction. PARS data was used to constrain a thermodynamic RNA structure prediction algorithm as binary inputs (paired vs. unpaired), while custom algorithm was developed to accommodate Frag-seq data. In essence, the nature, number, and location of structured regions in the transcriptome can be rapidly discovered, leading to many hypotheses and potential insights into gene regulation.
Comparison of PARS and Frag-seq reveals the complementary nature of the information that they both provide. First, because Frag-seq isolates RNAs between 20 to 100 bases after P1 cleavage without an additional fragmentation step, many sequence reads came from small nuclear RNAs, such as snoRNAs, while larger RNAs may be under-represented. Second, structured regions appear as “blanks” on Fraq-seq data, and other information is thus necessary to ensure that these regions are not missed due to mapping or cloning difficulties. Third, while PARS compares the cleavage sites of a single- vs. double-strand specific enzymes, Fraq-seq uses as background the endogenous 5′ OH and 5′ P within the transcriptome. This latter control can also identify regions that vary in their ability to be cloned and amplified during library production. Thus, by combining features from PARS and Frag-seq, future experiments can exploit the strengths of each to improve the accuracy of genome-scale measurements of RNA structure.
Recently, SHAPE has also been coupled to deep sequencing94. Lucks et al. in vitro transcribed seven short RNAs, each appended with a unique sequence tag (a barcode). After reacting with the SHAPE chemical 1M7 to acylate flexible bases, the reacted bases are indirectly detected by their ability to terminate the reverse transcription reaction and read out by sequencing the cDNAs. Because of the bar code, multiple sequences, even those with extensive sequence similarity, can be probed simultaneously. SHAPE-seq data correlate well with SHAPE followed capillary sequencing data for RNase P and pT181 attenuator, showing that sequencing largely captured similar structural information as capillary sequencing. This approach is likely useful for studying multiple mutants of one RNA or multiple members of closely related RNA family. Comparison of SHAPE-seq with PARS or Frag-seq illustrates several trade-offs in experimental design. The use of individual barcodes to assign identity to RNAs enables studies of highly related RNAs, but limits the ability to scale the same procedure up genome-wide, particularly when RNA sequences are not known a priori. Also, the choice to measure the cDNA product in SHAPE-seq, rather than directly clone the RNA fragments in PARS and Frag-seq, means that the processivity of reverse transcription becomes a dominant factor in SHAPE-seq data processing and the modeling of RNA secondary structure. SHAPE-seq signal progressively decays from 3′ to 5′ of the RNA template, the direction of reverse transcription, and a detailed mathematical model has been developed to correct for this signal decay95. Such models and the use of many more internal primers may allow full length mRNAs to be assessed by SHAPE-seq.
The genome-scale RNA structure maps have three important advantages over prior methods. The first advantage is the amount of data measured by deep sequencing, which in itself is rapidly developing. While RNA footprinting with capillary sequencing is still very much directed at interrogating a single RNA of interest, PARS and Frag-seq have the power of probing structures of entire transcriptomes, comprised of tens of thousands of transcripts. Second, the degree of parallel multiplexing is much enhanced in the new methods. Capillary sequencing is typically performed with one purified RNA product and one primer per well. Thus, to study multiple genes, an investigator needs to clone each of these genes as well as prepare unique primers that span the length of the transcripts. In contrast, due to the massively parallel nature of deep sequencing technology, thousands of distinct RNAs of multiple kilobases long can be probed easily with high throughput sequencing, as long as the RNAs are fragmented to a size that is captured by the library preparation. This genome-wide approach allows biologists to compare the structural profile of one transcript to another in the transcriptome easily, enabling them to classify transcripts according to specific structural features.
Finally, PARS and Frag-Seq can also perform de novo transcript discovery and probe the structures of RNAs that were either not known to be present previously or underwent post-transcriptional modifications such as alternative splicing or RNA editing. In contrast, for capillary sequencing (or SHAPE-seq as it is currently practiced), the nucleotide sequence, as well as how the RNA is spliced, needs to be known so as to design primers along the length of the RNA to identify the bases that reacted with structural probes. This process is not only tedious but also restricts capillary sequencing to be used on structure probing of transcripts that are well annotated in the transcriptome.
Despite potential advantages, care and thoughtful controls are necessary to design and interpret genome-scale RNA structure maps, as has been done with RNA footprinting by capillary sequencing96. Key considerations include replicates to examine reproducibility, titration of structural probes to maintain single-hit kinetics, and controls to assess various biases that may arise from library preparation, deep sequencing, or mapping97. The use of positive control RNAs with well known structuresthat are doped into the genome-scale reactions is a useful measure to assess the quality of structural information generated by deep sequencing.
Much remains to be done and learned from genome-wide maps of RNA structure. First, it is likely that multiple technical advances will improve the quality of the maps. With classic RNA footprinting, multiple enzymes and chemical reagents are used to generate a consensus picture of RNA structure, and it is likely that multiple reagents, including DMS, lead and others, will be adapted to deep sequencing readouts. The use of third generation, single molecule sequencing platforms that do not require amplification, and are capable of reading hundreds to thousands of nucleotides, may also expand the range of questions that can be addressed. For instance, long-range structural impacts of alternative splicing of exons located hundreds or thousands of bases apart can be more simply evaluated.
Second, in vivo and dynamic RNA structure maps will yield critical understanding of how RNA structures may change and help regulate different biological states. Currently, both PARS and Frag-seq have probed the structures of RNAs that are isolated from cells and renatured in vitro, but these techniques can be readily applied to native RNA isolated without denaturation. Several chemical probes such as lead, DMS, NMIA and hydroxyl radicals, have been used successfully to probe RNA structures in vivo by penetrating cellular membranes79, 82, 88, 92. RNA footprinting can also occur under diverse conditions, such as alterations in temperature, the presence of specific proteins, or small molecule ligands, to probe the impact of these perturbations on RNA structure10, 22, 98, 99.
Third, new computational strategies are emerging to better integrate experimental and computational RNA structures and delineate the impact on RNA function58, 59. The challenges are to predict the accurate structure of an RNA given its profile in the genomic RNA structure map, and further predict impacts of changes in the RNA structure (due to single nucleotide polymorphism, changes in biological state, or drug) on biological outcome. It is likely that cross comparison of genomic RNA structure maps with high resolution maps of RNA-protein interactions will be one immediate avenue whereby such integrative analyses can yield useful biological insights24, 25.
We gratefully acknowledge the support of NIH (R01- HG004361), Agency of Science, Technology and Research of Singapore (Y.W.), and A.P. Giannini Foundation (R.C.S.). E.S. is the incumbent of the Soretta and Henry Shapiro career development chair. H.Y.C. is an Early Career Scientist of the Howard Hughes Medical Institute.