|Home | About | Journals | Submit | Contact Us | Français|
In recent years, views of eukaryotic gene expression have been transformed by the finding that enormous diversity can be generated at the RNA level. Advances in technologies for characterizing RNA populations are revealing increasingly complete descriptions of RNA regulation and complexity—for example through alternative splicing, alternative polyadenylation, and RNA editing. New biochemical strategies to map protein-RNA interactions in vivo are yielding transcriptome-wide insights into mechanisms of RNA processing. These advances, combined with bioinformatics and genetic validation, are leading to the generation of functional RNA maps that reveal rules underlying RNA regulation and networks of biologically coherent transcripts. Together, these are providing new insights into molecular cell biology and disease.
Gene expression is finely regulated to ensure that the correct complement of RNA and proteins is present in the correct cell at the correct time. Owing to its diversity—in sequence and structure—RNA plays critical roles in cell biology, and is regulated by numerous proteins that modulate its content and spatial-temporal expression. Methodological advances, including bioinformatic, microarray-based, biochemical and deep sequencing studies, are producing new insights into the role that regulation of RNA complexity—the sum of the unique isoforms of RNA in a cell, including mRNA variants, non-coding RNAs and microRNAs (miRNAs)—plays in generating organismal complexity from a relatively small number of genes. Here we review this progress, focusing on mRNAs and the ways in which the technological advances are beginning to revolutionize our ability to understand the mechanisms and consequences of mRNA diversification.
The recognition of RNA regulation as a central point in gene expression and the generation of phenotypic complexity1 began with new methodologies and biological insights developed in the 1970s–1980s. Nascent transcripts were found to be generated as long heterogeneous nuclear RNAs (hnRNAs)2,3 (now termed pre-mRNA) that serve as precursors for smaller 5′ capped and 3′ polyadenylated mRNAs that are then exported to the cytoplasm. Insights into the mechanism by which pre-mRNA is processed to mature mRNA resulted from methodologic advances - including S1 nuclease mapping4 and electron microscopy to visualize R-loops of adenovirus mRNA:DNA hybrids5,6 - that enabled nucleotide-level examination of the precursor-product relationship of adenoviral transcripts. These efforts revealed that adenoviral mRNA has “an amazing sequence arrangement”6 such that processing of pre-mRNA to mature mRNA involves the intra-molecular joining (splicing) of expressed sequences (exons) that are separated by non-coding intervening sequences (introns)7 in the primary transcript (Figure 1). This was quickly recognized as a general feature of eukaryotic RNA processing8,9.
The discovery of splicing led to the realization that RNA has the potential to be more complex than DNA7,10. This potential was demonstrated by the finding, first in adenovirus11 and subsequently in eukaryotic cells during cell differentiation12 and in different tissues13, that alternative mRNA products could be generated from a single pre-mRNA precursor in a regulated manner. In this way regulation of alternative splicing and polyadenylation enables a single mammalian gene to encode multiple mRNAs that possess distinct coding and regulatory sequences.
A more recent epoch in understanding RNA complexity was ushered in with the ability to sequence complete genomes, and the concomitant realization that humans and worms have roughly the same number of protein coding genes (and, more recently, that human and chimpanzee genomic coding regions are 99.7% identical)14. These observations, together with the development of the RNA World hypothesis15, 16, led to a new concept that is explored in this review. This concept is that biological complexity—the variation in cell type and function—has RNA complexity at its core. In this view, it is the intricate unfolding of the genetic information in DNA into diverse RNA species - mediated by RNA-protein interactions - that leads to biological variation not evident from analysis of DNA sequence alone.
The known roles of RNA in the cell have expanded from it being a machine and template for protein synthesis to a regulatory hub for post-transcriptional control with emerging, and still incompletely understoood roles as a trans-acting factor that is capable of regulating expression of genetic information. For example, miRNAs17, piRNAs18 and long non-coding RNAs19,20 act to direct different RNA binding proteins (RNABPs) to their regulatory targets in order to suppress translation21, provide protection from transposable elements18, and mediate epigenetic changes1,22,23, respectively. Adding to its versatility, RNA transcripts are diversified from the point of transcription onwards through the action of a plethora of mechanisms, including alternative transcription initiation24–26, alternative splicing27–29, alternative polyadenylation30, RNA editing31, and post-transcriptional modification (pseudouridylation32, methylation33, and non-canonical polyadenylation and RNA terminal polyuridylation34,35). Once generated, mature RNA isoforms are subject to many levels of regulation that include the regulation of translation by miRNAs21 and regulatory factors36, the use of alternative translational start sites37, RNA localization38, and mRNA stability and turnover39,40.
RNA regulation is achieved through the concerted action of multiple RNABPs41 that bind to ‘core’ and ‘auxiliary’ elements, which are required for and modulate pre-mRNA processing events, respectively (Box 1). Core splicing elements demarcate exons and the sequences required for their splicing, and auxiliary splicing elements, which are located in introns and/or exons, bind factors that enhance or inhibit splicing. Similarly, mRNA 3′ end maturation also depends on the presence of core and auxiliary elements that define the site of transcript cleavage and polyadenylation42,43. The identification of alternative polyadenylation sites in the majority of human genes and evidence for tissue-specific biases in alternative polyadenylation8,44–46, suggests that regulation of alternative polyadenylation through auxiliary control might be a common mechanism to diversify the transcriptome.
Core elements necessary for pre-mRNA splicing include the 5′ and 3′ splice sites (SS), a branch point sequence (BP) upstream of the 3′SS, and a polypyrimidine-rich tract (PPT) between the BP and the 3′ SS. All of these elements are bound by components of the spliceosome, which is a dynamic macromolecular complex that consists of snRNAs and ~170 proteins29. Auxiliary sequences are variable in number and location - they can be located in exons and in the flanking intronic sequences - and are bound by factors that generally function to either enhance or inhibit basal splicing activity. The combinatorial actions of both core and auxiliary splicing factors participate in the regulation of alternative splicing. For example, the SR proteins comprise a family of auxiliary RNABPs that bind to splicing enhancer elements to facilitate exon identification and promote splicing (although like most RNABPs they are also able to serve other functions in the cell). In contrast, the binding of auxiliary hnRNP proteins to splicing silencer elements has a negative effect on exon inclusion; in many cases they antagonize the “pro-splicing” activity of SR proteins. Interestingly, the levels of some core snRNPs vary between tissues128, and such variations might contribute to splicing regulation41. Core elements necessary for maturation of the 3′ end of an mRNA include a poly(A) signal (an adenylate-rich hexameric sequence, most often AAUAAA) and a U/GU-rich sequence, which are positioned upstream and downstream of the poly(A) site respectively. These elements direct the endonucleolytic cleavage and polyadenylation of the transcript. Although a number of auxiliary elements that affect the use of poly(A) sites have been identified43, the extent to which these elements regulate alternative poly(A) site use remains unclear.
Current interest relating to RNA complexity has three main aspects: meeting methodological challenges so that the vast amount of information present in RNA can be collated; analysis of these data sets so that new rules of RNA regulation can be detailed; and application of the new insights, to achieve a basic understanding of cellular control and, ultimately, an understanding of gene dysregulation in human disease. This review will discuss each of these points - methodology, RNA analysis and, more briefly, its biological manifestations – in each case focusing on the control of RNA complexity. Although this review touches on many aspects of RNA function, including links to transcriptional and translational regulation, space does not allow a discussion of these issues, which are discussed in several excellent reviews19,24,36,41,47–50.
Although advances in the 1970s–80s came about through the detailed study of individual RNAs, the focus of recent technological advances is the characterization of whole RNA populations in cellular contexts with nucleotide level resolution. Accordingly, new methods able to simultaneously analyze multiple RNA processing events are culminating in the development of genome-wide RNA maps that pave the way to new biological insights.
Systematic efforts to identify RNA variants began with microarray technologies. A variety of different arrays have been used to elucidate RNA complexity. In particular, probesets for alternative exons identified from genome sequencing efforts have been used to analyze splice variants. The first use of these exon-junction microarrays to interrogate RNA populations from different tissues led to the recognition that a large number (at the time, the estimate was ~75%, but see below) of human multi-exon genes are alternatively spliced51. Similar exon-junction arrays have been used to identify tissue-restricted patterns of alternative mRNA expression and provide insights into their regulation by specific RNABPs52,53.
The ability of microarrays to provide valuable data on alternative RNA processing has led to their productive use as tools to assess mRNA diversity in different biologic contexts. Nonetheless, microarray studies have been limited by several factors. Two of these factors - the incomplete nature of gene annotations and limitations on microarray density - continually improve over time, but others, such as the need to predefine targets (such as alternative exons), preclude the identification of novel alternative mRNA isoforms. One effort to address this latter issue has been the development of microarrays that can be used to interrogate “complete” sets of transcribed exons54. Although these arrays do not monitor specific splice junctions, they have the advantage of expanded transcriptome coverage, which provides more reliable estimates of RNA abundance, and they are able to detect changes in the usage of individual exons (alternatively spliced isoforms) as well as variants derived from differential transcription regulation or alternative polyadenylation. More complete “genome tiling” arrays have been developed for yeast, Drosophila, and some human chromosomes55,56; these arrays circumvent the need for prior knowledge of the transcriptome. Analyses using tiling arrays reported that the vast majority of the human genome is transcribed57, although the biological relevance of these findings remains uncertain56,58. A final limitation of microarrays is they dependent on nucleic acid hybridization; researchers need to consider signal-to-noise ratios that can vary owing to the differences in base composition and annealing properties between individual probes. These limitations are being addressed with a new technology, direct high-throughput sequencing.
RNA-Seq (or next-generation RNA sequencing) (Box 2) takes advantage of the power of new single molecule sequencing methods59,60 that are currently able to produce billions of nucleotides of sequence in a matter of days for several thousand dollars. The power of RNA-Seq to assess mRNA complexity was highlighted in 2008 by the Blencowe61 and Burge45 laboratories, who provided complete RNA profiles and analysis of alternative splicing and polyadenylation variants in different tissues that easily rivaled those that could be obtained using microarrays. The ability of RNA-Seq to detect previously uncharacterized mRNA isoforms and new classes of non-coding RNAs62 illustrate the utility of this rapidly evolving technology, which is assuming an increasingly dominant role in RNA analyses. In addition, high throughput sequencing can be coupled with hybridization strategies to enrich specific RNA populations prior to sequencing. This balances constraints of hybridization technologies (as with microarrays) with the advantages of high throughput sequencing experiments, and has been effectively used to study RNA variants generated by RNA editing63,64.
The term RNA-Seq applies to any of several different high throughput (next-generation) sequencing methods to obtain transcriptome-wide RNA profiles59. Typically, RNA from two samples that are to be compared is sheared, converted to cDNA, and sequenced. This can yield, for example, up to 25 million sequence reads that are ~35 nt in length. Although there can be sequencing bias at any particular position in the genome - for example, depending on GC-content and/or the propensity of that sequence to be amplified by PCR - such errors will be the same across different samples. Therefore, differences between samples can be quantitated at the resolution of individual splice variants45 or even edited RNA nucleotides63. Other applications of RNA-Seq, using different sequencing strategies, include looking at pools of RNA that are being translated by sequencing RNA bound to ribosomes159 and single cell RNA analysis162. Currently 2.5 × 107 sequence reads can detect 2.5 × 105 different transcripts. This means that abundant transcripts are represented by many reads and rare transcripts by only a few; the sensitivity of this technique is likely to improve over time.
Bioinformatics has emerged as a powerful compliment to current efforts to analyze the complexity of cell-specific RNA signatures. Sequence-based bioinformatic approaches have long been applied to the study of pre-mRNA processing and have revealed consensus sequences that define the 5′ splice site65, the poly(A) signal that is necessary for 3′ end maturation and termination of transcription66,67, and atypical consensus sequences that define an entire alternative means for regulating splicing68. Current bioinformatic efforts are aided by, and are also dependent on, improvements in the number and depth of sequences available from EST and cDNA libraries, microarray datasets and whole genome sequencing. Therefore bioinformatics is likely to become more powerful as new technology improves such databases.
Comparison of RNA profiles from different cell types and organisms have helped to determine the frequency of alternative processing and the extent to which it is subject to species or tissue-specific regulation. In addition, analyses of sequences associated with conserved alternative processing events have helped to develop understanding of a number of aspects of alternative processing, including: identifying sequence elements that are potentially associated with the regulation of alternative processing52,69–72; investigating the origins of alternative splicing73; and defining unexpected features, such as ultraconserved elements that mediate nonsense-mediated decay of transcripts that encode RNABPs74. Although not the focus of this review, bioinformatics has also been used in efforts to identify miRNA targets47 and other regulatory elements in 3′ UTRs38,39,40
Bioinformatic, microarray, and high throughput sequencing studies have provided an unprecedented ability to describe RNAs on a genome-wide scale and to suggest which cis elements and trans-acting factors are associated with their regulation. However, these methods are limited without biochemical methods to identify the direct RNABP-RNA interactions that define that regulation in vivo. In general, researchers wish to distinguish between the primary (direct) and secondary (indirect) effects of RNA regulatory factors. For example, in the Fragile-X mental retardation syndrome, the loss of FMRP function is clearly the proximate cause of the disorder. Therefore there is great interest in identifying the RNAs that FMRP regulates in neurons and in distinguishing these direct effects from RNA dysregulation that is owing to secondary or tertiary consequences of FMRP loss75. Put another way, any perturbation in a cell is likely to disrupt the RNA profile of that cell, as detected by methods such as microarray or RNA-Seq. Therefore, changes in RNA profiles are expected after perturbation of RNABPs, and such changes cannot be taken as evidence of specific action of an RNABP. Attempts to study mechanisms of RNA regulation in cells depend on distinguishing direct from indirect consequences of cellular manipulations.
Multiple approaches have emerged for the biochemical identification of functional RNABP-RNA interactions in vivo. These include immunoprecipitation of RNABPs followed by purification of the co-precipitating RNA and analysis by RT-PCR or microarray analysis76,77. These strategies have proven useful, but they cannot discriminate direct from indirect interactions, nor identify RNA-protein binding sites. Moreover, they are limited by the need to use relatively low stringency conditions to maintain protein-RNA interactions, and such conditions are associated with problems related to signal-to-noise ratio, co-precipitating RNABPs and RNABP-RNA re-association in vitro78–80.
An alternative means of identifying regulatory RNABP-RNA interactions is the CLIP (cross-linking and immunoprecipitation) assay78,81,82 (Box 3). CLIP applies the observation - first made in the study of tRNA-protein interactions in the 1970s83 and even earlier for DNA-protein interactions - that UV-irradiation causes covalent cross-linking between RNA-protein complexes that are in tight apposition (that is, within ~Ångstrom distances). UV cross-linking was applied in a cellular context in studies of protein-RNA interactions by van Venrooij84 and Pederson47,85,86 in the 1980s, and were then refined by immunoprecipitation of cross-linked hnRNP-RNA complexes by Dreyfuss87 and colleagues.
CLIP takes advantage of the ability of UV-irradiation to penetrate intact cells or tissues and induce covalent crosslinks between RNA and proteins that are in direct contact (~1 Ångstrom apart). A flow diagram of the experimental steps is shown in the figure. Once they have been covalently bound, RNA-protein complexes can be purified under harsh conditions, which gives the advantage of being able to separate them from closely bound RNABP-RNABP complexes, reassociated RNAs, and background RNA. After purification, the CLIP method81,82,92 utilizes proteinase K to remove the RNABP. This is followed by linker ligation and RT-PCR to analyze the RNA sequences. This sequencing analysis can be done using high throughput sequencing methods, in which case it is referred to as “HITS-CLIP” 88,91. The details of HITS-CLIP are likely to be modified and improved over time. For example, more efficient sequencing and the use of ever-smaller sample sizes are likely to be possible. Current methods and algorithms for analyzing HITS-CLIP data can be found at www.rockefeller.edu/labheads/darnellr/. It should be noted that it remains to be determined whether HITS-CLIP has limitations in terms of efficiency of crosslinking specific subsets of RNA-protein interactions. However, to date, microarray and HITS-CLIP studies have yielded very similar results53,88, which suggests that crosslinking can be highly efficient across the transcriptome.
CLIP allows the purification of RNABP-RNA interactions that are occurring in live cells, or even whole tissues such as the brain, to be covalently “locked” in place and rigorously purified. This yields a population of RNA sequences that are directly bound by the RNABP of interest. Sequencing of this population provides a means of identifying the bound RNA, and importantly, the position of protein binding. The CLIP method (and emerging methodological improvements to this method 81,82,88–91) established that small crosslinked RNA fragments could be amplified by RT-PCR amplification after partial RNase and proteinase K digestion. This approach - initially using conventional strategies92 and more recently using high throughput sequencing (HITS-CLIP) 88 - reveals the RNA “sequence footprints” that are bound by RNABPs and provides a powerful way to study RNA-protein interactions in living tissues on a transcriptome-wide level1,41,59. So far, HITS-CLIP has been used to generate high resolution genome-wide assessments of RNABP-RNA interactions in mouse brain88, stem cells90, and tissue-culture cells93, and to deconvolute Argonaute-miRNA-mRNA ternary interactions in the mouse brain91 (discussed further below).
As new methods have improved the ability to assess mRNA complexity, estimates of the extent to which alternative RNA isoforms contribute to functional diversity have increased. Recent efforts to characterize the mRNA signature of different human tissues using RNA-Seq have revealed that nearly all multi-exon human genes (comprising >90% of all genes) generate alternative mRNA isoforms, and most do so in a tissue-specific manner45,61. These alternative isoforms include variants that arise from alternative transcription initiation and from all known forms of alternative pre-mRNA processing. In addition, high throughput sequencing combined with target enrichment has been used to assess the diversity generated by over 36,000 sites at which RNA editing occurs63. All of these means of modifying RNA transcripts generate complexity of both protein coding mRNAs and ncRNAs; we focus here on RNA as the regulated substrate (as opposed to DNA as the substrate, which is reviewed elsewhere24,26,41), and note that, to date, most experimental validation has been done with protein coding mRNAs.
Alternative splicing is one of the major ways in which RNA diversity is generated. Comparative analysis of splicing variants are yielding insights into its biological consequences76,94–97. Interestingly, RNA-Seq-based characterization of tissue transcriptomes, together with microarray analyses52,54,69 and comparative bioinformatic studies71,98, identified the mammalian brain as the tissue expressing the greatest number of alternative mRNA isoforms. This is likely to be related to the fact that this tissue is populated by thousands of highly specialized unique cell types that undergo dynamic changes. In the nervous system, alternative splicing has many important roles including controlling the spatial and temporal expression of isoforms necessary for neurodevelopment and modification of synaptic strength95,99.
Important general issues regarding the complexity of alternative splicing are highlighted by contrasting studies of Dscam (Down Syndrome cell adhesion molecule) and neurexin splicing in the nervous system. In Drosophila, Dscam - which is believed to be crucial for proper neural circuit formation - encodes many thousands of neuron-specific RNA variants that are produced by alternative splicing100. Despite the great complexity of RNA products, and the recognition that RNABPs act to restrict Dscam exon usage101, it is believed that the choice of RNA variants produced in any one neuron are largely stochastic, and the resulting biologic complexity is proportionately low. Each RNA variant encodes a cell surface axonal molecule that is randomly generated to be different from that on neighboring axons, thereby yielding a unitary outcome - that is, the inability to fasciculate with neighbors100.
Regulated, rather than stochastic, production of alternative RNA variants, has the potential to generate a great diversity of biological function. Alternative splicing of neurexin pre-mRNA in mammalian brain provides an interesting example. Nearly 3000 unique neurexin transcripts are derived from the combination of three genes, each of which has two alternate promoters and encodes transcripts with ~10 alternate exons102. This set of alternate transcripts encodes variants that give rise to alternative neurexin protein isoforms, which have different interactions with different neuroligan protein isoforms across the synaptic cleft. This suggests that a ‘splice code’ might underlie trans-synaptic cell adhesion103. There is evidence that a small number of RNABPs might regulate neurexin (and neuroligan) isoforms95, which may in turn generate diverse biological outcomes103. These observations underscore the more general point that alternative splicing plays a major role in biological complexity28.
Although it is clear that alternative processing of pre-mRNA can confer different structural and functional properties to proteins76,104, additional functional roles for alternative processing in the regulation of gene expression have also emerged. Consistent with EST-based bioinformatic studies46, RNA-Seq analysis identified tissue-specific biases in the regulation of tandem polyadenylation sites (Figure 2A, Box 1). Unlike the alternative poly(A) site regulation that is coupled to inclusion of an alternative 3′ terminal exon, alternative polyadenylation at tandem poly(A) sites can yield transcripts with identical protein-coding sequences but with different 3′ UTR sequences. This provides the potential for differential regulation of mRNA expression by RNABPs and/or miRNAs (Figure 2A). Exon microarray and RNA expression studies have indicated that such regulation might have important biologic consequences. Proliferating cells—T lymphocytes105 and tumor cells106—harbor shortened 3′ UTRs. By contrast, the brain - a non-proliferative tissue - appears to regulate polyadenylation so that transcripts harbor, on average, longer 3′ UTRs45,88. These studies suggest that these differing cell types regulate polyadenylation in opposite ways to allow RNA to escape from, or be subjected to, different levels of regulation. There are likely to be multiple mechanisms of regulation, including miRNA-mediated regulation of translation105,106, RNA localization38 and stability39,40.
A recently recognized example in which alternative processing is coupled to post-transcriptional control is that of alternative splicing events that result in the introduction of a premature termination codon (PTC), which targets mRNA for degradation by nonsense-mediated mRNA decay (NMD)24,48,107 (Figure 2B). Although EST-based bioinformatic studies108,109 had suggested that that alternative splicing coupled to NMD (AS-NMD) is a widely used mechanism for controlling RNA abundance, how widespread it is remains unclear. However, AS-NMD has been shown to regulate the expression of many splicing regulatory factors (some in an auto-regulatory manner), including the SR proteins74, hnRNP proteins110–112, and core spliceosomal proteins109. Interestingly, some of the exons whose splicing or skipping results in a PTC are associated with ultraconserved elements74,113. This suggests that AS-NMD might be an evolutionarily ancient mechanism that is used to establish the correct balance of nuclear RNABPs that is necessary to generate cell-type and developmental-stage specific mRNA profiles110,111.
The dependence of pre-mRNA processing events on multiple RNABP-RNA interactions provides multiple steps at which processing can be regulated. Core elements might be directly involved in the regulation of exon usage, through regulation of core factor stoichiometry29,114. Although both SR proteins and hnRNP proteins are widely expressed, changes in their stoichiometry can mediate tissue-specific differences in alternatively splicing115–117. Additionally, the activity of RNABPs can be regulated by post-translational mechanisms, including phosphorylation and subcellular sequestration in response to cellular or metabolic stress41,118. Such mechanisms can convert a general splicing repressor to a sequence-specific splicing activator119. Therefore it is not sufficient to rely solely on correlative expression data to build models of RNA regulation of vivo.
An additional layer of pre-mRNA regulation is imparted by tissue-specific RNABPs. Multiple examples of highly related factors with non-overlapping patterns of expression have been described, including the Nova proteins (Nova1 and Nova2)78, the polypyrimidine tract binding proteins120–122, Elavl (Hu) proteins123, Fox proteins72,90, and the CELF and MBNL proteins97,124. Although many homologous tissue-restricted factors show high levels of conservation, multiple mechanisms provide each homologue with a unique pattern of expression, which suggests the homologues have distinct functional roles. For example, cross-regulation at the RNA level ensures that Ptbp1 and Ptbp2 have mutually exclusive expression patterns in mouse and human cells, and this is believed to be critical for the regulation of neuronal differentiation110,111,122. In general, it is anticipated that the relative amounts of different positive-and negative-acting RNABPs might define a “cellular RNA processing code” that dictates the pattern of processing for each pre-mRNA, so that pre-mRNAs with the same set of regulatory elements can be regulated in a coordinate manner96,125–127. As detailed below, the application of new methodologies are advancing these concepts in expected and unexpected ways, and are revealing details of the mechanisms - including cis and trans acting codes - that underlie the establishment and regulation of cell-specific RNA profiles.
Changes in the expression of numerous RNA-regulatory proteins are coincident with changes in tissue and developmental mRNA profiles97,128. A challenge for the future is to understand how the expression and activity of these regulatory factors are regulated, and how multiple factors in combination control the fate of transcribed RNA. Computational analyses have shown that alternative processing events are associated with highly conserved sequences and have identified elements that are enriched near regulated processing sites and are therefore likely to be functionally important for protein binding and regulation45,52,69,71,72,88,95,97,129–132. However, only some of the enriched elements correspond to sequences that have been demonstrated to be bound by specific RNABPs, and in most cases in vivo studies have yet not been performed to test the functional significance of suspected RNABP-RNA interactions. Interestingly, highly conserved intronic sequences that are associated with alternative splicing events are large enough to accommodate many RNABP-RNA interactions, which is consistent with the idea of combinatorial control involving multiple RNABPs52,70,71.
Recently, the complexity of RNABP action has begun to be addressed by combining genetic models with high throughput biochemical, bioinformatic, and RNA profiling methods. Such studies have been facilitated by the development of animals with genetically modified RNABP expression—mouse knockouts53,88,133–135, transgenic mice97, morpholino-treated zebrafish embyros136, or cultured cells in which the expression of specific RNABPs has been knocked down by siRNA69,72,90,136–139. The use of high-throughput methods in conjunction with these models is now allowing the identification and functional validation of RNA-protein interactions on a transcriptome-wide scale.
The generation of transcriptome-wide maps of functional RNABP-RNA interactions are providing insights into the rules by which RNA complexity is regulated. For example, these studies have generated compelling evidence that the position of RNABP-RNA interactions within primary transcripts dictates the functional outcome of alternative pre-mRNA processing events (Figure 3). Initial ideas relating to Nova-mediated RNA regulation in mouse brain were provided by detailed studies of two transcripts studied in vitro and tissue culture cells134,140,141. Subsequently, a combination of studies in Nova knockout mice142, including exon junction arrays53, bioinformatics69,134 and HITS-CLIP88, expanded these ideas into a general rule. In this work, and in subsequent studies of the Fox1/2 splicing factor with analogous findings69,72,90, it was shown that binding of RNABPs within an alternative exon or the flanking upstream intronic sequence is generally associated with exon skipping, whereas binding of RNABPs to the downstream intronic sequence is generally associated with exon inclusion (Figure 3).
The extent to which such position-dependent regulation is a feature of other RNABPs is not currently known, however there is reason to believe that such interactions with target pre-mRNAs may also prove to play general features of RNABP regulation, from bioinformatic and biochemical studies of other RNABPs, including Mbnl, Celf, Ptbp169,97 and several hnRNP proteins (A/B, L, LL, F and H)41,138. The application of genetic systems and high throughput approaches to identify transcriptome-wide interactions and assess their functional significance will provide a greater understanding of the mechanisms by which RNABPs act, in isolation and combinatorially, to regulate gene expression.
Mapping functional transcriptome-wide RNABP-RNA interactions in an unbiased manner can reveal unanticipated functions for RNABPs in generating RNA diversity and regulation. For example, HITS-CLIP combined with microarray analysis of wild type and Nova2 knockout mouse brain led to the identification of an unexpected role for Nova2 in regulating alternative polyadenylation in the brain88. Such studies illustrate a point previously recognized, albeit not on a transcriptome-wide scale, for SR and hnRNP proteins: that RNABPs cannot be neatly allocated to a single functional category, rather they are multifunctional proteins that participate in many aspects of RNA biochemistry. HITS-CLIP analysis of the SR protein Sfrs1 (previously known as Asf/Sf2) in human embryonic kidney cells revealed an over-representation of binding to mRNAs encoding RNA regulatory proteins, which suggested the possibility of a regulatory loop93. Another new aspect of RNABP regulation emerged from CLIP analyses of hnRNP-A1. These studies revealed that hnRNAP-A1 binds to the stem-loop sequences in the miRNA precursor primiR-18a143 in HeLa cells, and in so doing functions as an auxiliary factor to enhance Drosha-mediated processing to mature miR-18a144.
Recently, HITS-CLIP was extended to the study of ternary interactions between an RNABP (an Argonaute protein; Ago), RNA and miRNAs91. These studies developed a genome-wide map of miRNA binding sites in mouse brain transcripts. Such studies offer a means to resolve the difficulty bioinformatic approaches have had in identifying bona fide miRNA seed sites. In addition, they might also yield new rules of RNA regulation—27% of Ago binding sites appeared to be “orphans” in which no miRNA binding site could be identified. Therefore there might be new rules of miRNA-mRNA interactions that are yet to be elucidated.
Prior to the onset of high throughput methods, a number of observations suggested that some level of biological coherence is established by RNA regulation—the idea that coordinate regulation of RNAs encoding related proteins coordinates biological processes. Observations of biological coherence of RNA regulation during sex determination in Drosophila and iron-response pathways in vertebrate cells in the 1990’s were followed by more general hypotheses of functionally coherent networks in yeast, tissue culture cells and mouse brain, as recently discussed95,127,145. However, the inability to distinguish direct from indirectly regulated RNAs complicated evaluation of such networks.
Now, the combination of genetic systems, bioinformatics and biochemistry can be used to uncover functional roles and networks of RNABPs by rigorously identifying validated sets of transcripts and the biological functions of the encoded proteins. For example, analysis of RNA from wild type and Nova knockout mouse brains on exon junction microarrays53 revealed that Nova regulates alternative splicing of a biologically coherent set of transcripts encoding proteins with synaptic functions53,88,95. HITS-CLIP and bioinformatic studies88 showed that a subset of these transcripts were directly regulated by Nova. This network has been able to predict aspects of Nova physiology in the mouse brain, including roles in inhibitory potentiation in the hippocampus146 and in motor neuron function147. Taken together, these studies provided the first demonstration in mammals of the coordinated activity of an RNABP in a biological network. Similarly, analysis of RNA regulatory defects in mouse knockouts of Sfrs1135,148, Srp38133 and the Celf and Mbnl proteins97,149 are poised to reveal the direct roles that different factors have in generating the specific alternative mRNA isoforms that are necessary for proper tissue development or functionality.
The importance of methods to probe mRNA complexity and understand its regulation is underscored by the growing list of human diseases that are associated with defects in the expression of alternative mRNA isoforms27,94. This list includes diseases that result from mutations that activate cryptic splice sites or disrupt sequences that are necessary for RNA processing, which lead to the alteration of specific protein isoforms or transcript destabilization. Also, there is also a growing list of disorders that show changes in RNABP expression and/or activity owing to mutation, autoimmune targeting or sequestration of RNABPs. Such disorders seems to particularly affect complex tissues, and are exemplified by neurodegenerative disorders. RNABPs that have been linked to neurodegeneration include: FUS and TDP-43, which are mutated in patients with familial ALS150; Nova and the Elavl (Hu) proteins, which are targeted by the immune system in paraneoplastic neurodegenerative disorders78,151; SMN1, which is mutated in spinal muscular atrophy27; IGHMBP2, which is mutated in spinal muscular atrophy and respiratory distress152; senataxin, which is mutated in ALS4153; and glycyl tRNA synthetase, which is mutated in hereditary motor neuronopathy type V154. Moreover, a growing number of neurological disorders are believed to be linked to RNA expansions that sequester RNABPs, as exemplified by the sequestration of MBNL by CUG repeats in myotonic dystrophy149. Similarly, the deletion, mutation or inappropriate expression of miRNAs, which leads to mistargeting of the RNABP Ago and to aberrant RNA regulation17, is important in multiple disorders27, including neurologic disease, cancer, and autoimmunity. Although we are just beginning to appreciate the role of RNABPs in human disease, methods that allow researchers to overlay RNA sequence profiles and RNABP maps offer a new means of comparing protein-RNA interactions in normal and diseased tissues.
There are also many examples of defects in the expression of alternative mRNA isoforms and RNABPs in disease for which a defined causal relationship has not been shown. For example, microarray and high throughput RT-PCR analyses have detected alternative splicing events associated with different types of cancer and have identified ‘splicing signatures’ associated with different histologically defined tumor subgroups155. It seems likely that the expression of aberrantly spliced transcripts will be found to contribute to tumor biology. Efforts to identify alternative splicing markers associated with disease, combined with bioinformatic analyses, are providing insights into the mechanisms of RNA regulation that, when perturbed, might result in disease. For example, consensus binding sites for the Fox1/2 RNABPs were identified near many alternative exons that were mis-spliced in ovarian and breast cancer156. Evidence suggesting that the Fox proteins directly regulate these alternative splicing events include decreased levels of Fox2 in ovarian cancer and the recapitulation of cancer-associated splicing defects by knock-down of Fox2 expression in cultured cells. Such efforts are providing new insights into the extent to which alternative mRNA isoforms correlate with, and in some cases cause, disease and how disruption of RNABPs that have tumor suppressor156 or proto-oncogene157 activities might lead to the aberrant mRNA processing events that are associated with cancer. Considering the many ways in which alternative processing can affect gene expression, the ability to characterize RNA profiles and regulation in disease will likely play a major role in advancing our understanding of the biology of disease and assist in the development of strategies for therapeutic intervention.
Methodological advances in the 20th century led to the realization that RNA complexity and its regulation lies at the core of biologic complexity. In recent years, the advent of high throughput strategies have enabled nucleotide level analyses of RNA regulation and complexity on a genome-wide scale, which have revealed insights into the extent to which mRNA diversification contributes to cell-specific biology, and the mechanisms by which this diversification is achieved. A challenge for the future will be to determine the extent to which different RNA isoforms contribute to biological complexity.
The complimentary methodological approaches that are described in this review that each give powerful but incomplete data about RNA regulation: methods to enumerate RNA variants (microarrays and RNA-Seq) and bioinformatic approaches are correlative, and biochemical crosslinking alone does not yield functional data. Importantly, combining these efforts (Figure 3) offers the opportunity to identify and experimentally investigate different types of RNA regulatory mechanisms. Such studies have revealed that RNABPs regulate biologically coherent RNA networks, and unanticipated mechanisms by which they do so are emerging. The variety of interactions that are evident from genome-wide studies of RNABPs emphasizes that they are multifunctional proteins whose activities depend on affinity constants and local concentrations of proteins and their RNA substrates. Therefore an important consideration for the future will be to consider how RNABPs act in the context of their local environment—nuclear compartments, cytoplasmic P-bodies, stress granules, dendrites, and so on—and the impact that accessibility of RNA targets has on RNABP activity.
Another challenge will be to take individual RNA maps - each based on genetics, bioinformatics and genome-wide biochemistry - and superimpose them to give a more complete picture of how regulation works inside a cell, in which hundreds of RNABPs simultaneously compete to regulate thousands of RNAs. Such pictures will be needed to interpret the dynamics of RNA-protein interactions during biological processes158. Analysis of RNA-protein regulatory maps is also likely to yield insight into non-coding RNAs and their roles in coordinating gene regulation. Finally, application of the methods and concepts reviewed here will advance our understanding of other RNA regulatory mechanisms. For example, translational control is beginning to be studied by using high-throughput methods: yeast translation was recently studied by using RNA-Seq159 to characterize polyribosomal RNA, and mouse genetics coupled to microarray profiles160,161 was used to profile transcribed mRNAs within individual neuronal subtypes. Combining the methods described in this review with single cell and, ultimately, subcellular analysis will offer the opportunity to understand RNA function in a variety of cellular contexts. Such studies enhance discovery of how RNA regulation impacts tissue complexity and disease by shaping the expression of genetic information.
We are grateful to members of the laboratory for thoughtful discussions. We apologize to the many colleagues whose interesting studies we reviewed but were unable to cite here due to space limitations. This work was supported by grants from the NIH and the Howard Hughes Medical Institute.
Donny D. Licatalosi
Dr. Licatalosi received his Ph.D from the Department of Biochemistry and Molecular Genetics at the University of Colorado Health Science Center, Denver, USA, and is currently a postdoctoral associate in Dr. Robert Darnell’s lab at the Rockefeller University, New York, USA. His research interests include understanding RNA-based mechanisms that control the expression of genetic information.
Robert B. Darnell
Dr. Darnell received his undergraduate degree in Biology and Chemistry at Columbia University, New York, USA, and his MD and PhD in molecular biology from Washington University School of Medicine, St. Louis, USA. After finishing clinical training as Chief Neurology Resident at Cornell, Ithaca, USA, and Memorial Sloan-Kettering Cancer Center, New York, USA, he joined the Rockefeller University in 2002, where he is the Heilbrunn Cancer Professor, Senior Physician, and an Investigator of the Howard Hughes Medical Institute. He pioneered studies of the paraneoplastic neurologic disorders and his work led to the discovery of several genes encoding neuron-specific RNA binding proteins, which he is studying in the brain with innovative methods.