|Home | About | Journals | Submit | Contact Us | Français|
The advent of high-throughput technologies including deep-sequencing and protein mass spectrometry is facilitating the acquisition of large and precise data sets towards the definition of post-transcriptional regulatory networks. While early studies that investigated specific RNA-protein interactions in isolation laid the foundation for our understanding of the existence of molecular machines to assemble and process RNAs, there is a more recent appreciation of the importance of individual RNA-protein interactions that contribute to post-transcriptional gene regulation. The multitude of RNA-binding proteins (RBPs) and their many RNA targets has only been captured experimentally in recent times. In this review, we will examine current multidisciplinary approaches towards elucidating RNA-protein networks and their regulation.
The challenges of understanding post-transcriptional gene regulation are manifold. They include the cataloging of RBPs and their respective RNA binding sites, interpretation of cell-type-specific differences in RBP and target RNA isoforms, and attributing a regulatory or biochemical function to the interactions. Since RNA is a moving target inside cells, consideration of its subcellular localization and those of its interacting proteins are relevant to their encoded function. Upon transcription of a single gene, tens to thousands of RNA copies are synthesized and undergo a series of maturation steps that occur in different locations within the cell. Therefore, RNA-based regulation is dependent on the concentration of the target RNA and its potential regulators, in contrast to transcriptional regulation, where the number of targeting sites on DNA is at a first approximation independent of the expression program of the cell (Figure 1).
Messenger RNAs (mRNAs) are subject to various maturation steps that involve protein binding and result in formation of mRNA ribonucleoprotein particles (mRNPs), whose respective compositions remain poorly defined albeit are known to control constitutive and alternative mRNA splicing, 5′-methyl-guanosine capping and 3′ polyadenylation, export out of the nucleus, and ultimately loading onto ribosomes for protein translation. mRNA-binding proteins (mRBPs) were shown to have hundreds to thousands of different mRNA targets and individual mRNAs may harbor many sites occupied in a spatial and temporal order as they mature and decay inside cells. For a given mRNA-binding protein (mRBP), in order to elicit its respective function a sufficient fraction of the cellular copies of its target transcripts are required to be occupied. RNA binding proteins have been implicated as posttranscriptional modulators of gene networks in a wide range of human diseases including cancer. To understand how dysregulation of posttranscriptional networks can lead to human diseases, the breadth of RBP targets and their specific binding sites need to be determined. The capture and functional interpretation of RNA-mRBP interactions at the transcriptome-wide level can be staggering and implicitly argues for the need to combine focused biochemical and structural characterization with large-scale systems-biology approaches.
Covalent crosslinking of RNAs to proteins by exposure to short- or long-wavelength ultraviolet (UV) light coupled with deep-sequencing of cDNA libraries derived from the isolated crosslinked RNAs allows investigators to identify the RNA targets of RBPs with previously unprecedented scale and depth. The majority of methods for identification of binding sites of RBPs employ the use of an enrichment protocol, either by immunoprecipitation of the RBP of interest[1–3] or affinity purification of an ectopically expressed tagged form of the RBP, e.g. by a His-tag[4,5]. The methods are also distinguished by different strategies for separating signal from background RNA inevitably present in these approaches. Crosslinking and immunoprecipitation (CLIP) methods coupled with deep-sequencing were shown to be suitable for transcriptome-wide mapping of the binding sites of RBPs or microRNPs in living cells and uncovered post-transcriptional regulatory networks in cells and tissues. CLIP methods yield sequencing reads corresponding to RNA segments at or near the binding sites. Binding sites typically contain a defined, often short RNA sequence and/or a structural element, known as the RNA recognition element (RRE), which specifically binds to amino acid residues located in the RNA-binding domain (RBD) of an RBP. Isolation and characterization of crosslinked mRNA fragments by sequencing represents an advantage over earlier methods such as RIP-chip and RIP-seq that determine enrichment of full-length target mRNAs in IP versus lysate but do not directly isolate the binding site. The absence of covalent crosslinks also requires the use of less stringent, non-denaturing RNP isolation protocols yielding more background. Nevertheless, RIP-chip or RIP-seq without crosslinking are valuable methods as they can provide a measure of the affinity of an RBP towards its mRNA targets. Combining RIP-chip with CLIP allows for a detailed analysis relating binding site information with overall mRNA-binding affinity offering a starting point for the characterization of interesting post-transcriptionally regulated mRNAs. Such a combined approach has been successfully applied to identify targets and begin to address the mechanism of target selectivity for a number of RBPs[9,10], including HuR, a protein implicated in numerous cancers. Elevated expression of HuR has been detected in various cancers with poor outcome prognosis. The combination of RIP-chip and CLIP identified a transcriptome-wide list of targets and indicated that HuR functions at several levels of post-transcriptional regulation including pre-mRNA processing and mature mRNA stability. Further, an analysis of the proximity of HuR with miRNA binding sites indicated that HuR binding sites that were within 10 nt or fewer of a miRNA target region, prevented miRNA-dependent repression.
The identification of specific RNA-interacting proteins from cell lysates is accomplished by mass spectrometry. Early approaches used conventional chromatography[12,13], or affinity purification using antibodies raised to known mRBPs to identify proteins present in mRNPs. Alternatively, RNA conjugates can be incubated with lysates for in vitro RNP assembly, then affinity-purified to identify interacting proteins. In order to distinguish proteins that directly contacted RNA from proteins that are only present in assembled RNPs due to protein-protein interactions, RNA-protein covalent bonds are formed by UV-crosslinking to enable harsher purification steps required for removal of non-covalently bound proteins. Highly purified crosslinked RNP products were also used to precisely map the amino acids directly contacting RNA[17,18]. Crosslinked RNPs are typically treated upon cell lysis with RNases to obtain smaller size (20–50 nt) RNA fragments, which are radiolabeled upon immunopurification of the specific RNPs and resolved by SDS-PAGE fractionation for its detection and isolation.
The identification of RBPs specifically binding to a given target mRNA segment represents a challenging biochemical problem, because in cell lysates, all RBPs will simultaneously compete for the target RNA in the absence of an otherwise temporally and spatially organized process. As an example, an intronic sequence RNA, which is normally retained in the nucleus may be bound by cytoplasmic RBPs in lysates, and vice versa. Furthermore, abundant RBPs with less sequence specificity or short RREs effectively compete with less abundant more specifically binding RBPs and yield high background binding. To address this background problem, SILAC mass spectrometry approaches were recently developed[19,20]. In these assays, RNA bait and reference segments of about 100–200 nt are incubated in distinct isotope-labeled cell lysates, followed by the determination of isotope-enrichment of proteins bound to the RNA bait versus the reference. These approaches allow for medium-high throughput of hundreds of RNA segments, which can be useful to scan the sequence of a selected mRNA and find both its potential interactors and regulatory regions.
In contrast to RNP assembly using cell lysates, in vivo assembled RNPs avoid potential biochemical artifacts. This possibility prompted the development of RNA tags that are joined to and therefore co-expressed with the RNA of interest[21–24]. The most common approach introduces one or more copies of an RNA-encoded aptamer, which is recognized by a protein (e.g. MS2 coat protein or streptavidin) or small molecule (e.g. streptomycin) that is immobilized on chromatography matrix. More recently, SILAC was coupled to RNA-based aptamer approaches as a means of identifying and quantitating affinity-purified and in vivo assembled RNPs. Furthermore, significant advances in the visualization of mRNAs in living cells can be attributed to the use of in vivo RNA tags[26,27].
In an effort to catalog and discover new mRBPs, recent modern mass spectrometric approaches were applied to determine the protein composition of UV-crosslinked polyadenylated (polyA) cellular RNAs (Figure 2A). Crosslinked mRNPs were isolated under protein-denaturing conditions from cultured human cells using oligoT beads[28,29]. The Hentze group identified 860 crosslinked proteins in HeLa cells, while the Landthaler group identified 797 in HEK293 cells, 554 of which were jointly identified, i.e. corresponding to a 64% or 70% overlap respectively. A comparison to a table composed of 1,783 human genes encoding known RBPs or proteins annotated to contain RBDs, showed that the majority of the Hentze and Landthaler identified proteins, 618 (72%) and 624 (76%), respectively were previously known. Within this category, 480 proteins were jointly identified corresponding to a 78% and 77% overlap, respectively. The abundance and specificity of expression for overlapping and non-overlapping RBPs were comparable between both groups, indicating that differences in the experimental methods were responsible for the discrepancy. The overlap between proteins newly identified as candidate RNA-binding was 74 proteins, corresponding to a 31% and 43% overlap between the Hentze and Landthaler datasets, respectively. Given that abundance and specificity of expression for the novel RNA-binding protein candidates was similar and comparable to the canonical RBPs, the approximately 2-fold reduced overlap between candidate RBPs compared to known RBPs may suggest an increased false positive rate or differences in the bioinformatic analyses when setting cut-offs for defining novel RBPs. Both laboratories performed further validation studies for a fraction (<5%) of their candidate RBPs, thereby providing additional RNA-binding evidence (Table 1). Some of these novel RBPs belong to large ubiquitously expressed protein families whose other members were not captured by their methods, which is difficult to rationalize based on the expectation that specific protein domains, presumably conserved among family members, are required for binding RNA. A domain analysis of the full set of candidate RBPs of both studies showed an accumulation of nucleotide (ATP, GTP) and nucleotide analog (e.g. NAD) domains, which may be cautiously interpreted to indicate that these domains also interact with longer RNAs, possibly at their termini, or might imply that their activity is modulated by interacting with mRNAs.
Some of the novel proteins contain repetitive, low complexity amino acid repeat motifs, such as arginine glycine (RG) rich repeats which have been previously demonstrated in RNA binding interactions. It is plausible that repetitive amino acid repeat motifs could confer RNA-binding although such a model does not fully address why family members of newly identified RBPs were not found. Another explanation may be that these proteins evolved further to contain unique RNA-binding surfaces within largely pre-existing domain structures, or vice versa, RNAs evolved their sequence to modulate the activity of these specific proteins. This would imply that some RREs could be as complex as RNA aptamers to specifically bind to a given protein. For such complex and unique interactions to be identified by mass spectrometry, it would imply that the target mRNA was present at very high copy numbers.
The molecular understanding of RNA-protein interaction is derived from the elucidation of the solution or crystal structure of RBPs with a cognate target. It is important that such studies be put into perspective with large-scale sets of data that can either agree with the solved isolated structure, or prompt the need for additional study. Structures of the PUM and NOVA protein families are known examples where the molecular contacts observed in RNA co-crystal structures correspond to the RREs deduced from large-scale natural target RNA studies. Extensive characterization of PUM2 targets by microarray and CLIP studies largely confirmed earlier biochemical assays and structural results. PUM RBPs appear unique among RBPs, as each RBD repeat can contact a single nucleotide base of an RNA molecule of the consensus sequence (UGUANAUA). PUM proteins bind their targets highly selectively and are among the most sequence-specific RBPs. The NOVA proteins are composed of three KH domains and its RREs are composed of YCAY (Y= U or C). X-ray crystallography was first performed using the third KH (KH3) domain and an RRE-comprising RNA target[36,37], but a recent crystal structure of the KH1 and 2 domains showed that these other domains also recognized YCAY elements. These findings were consistent with UV254 CLIP experiments, and further indicated that RNA targets with repeated YCAY elements can be bound by a single RBP.
For many RBPs, structural results are yet to be fully-resolved with transcriptome-wide studies. For example, the KH1 and 2 domains of FMRP, which adapt a typical KH domain protein structure and are therefore expected to bind a 4-nt RRE typical of other KH domain proteins, were initially shown to bind to a kissing complex RNA, that has not been demonstrated to occur within natural targets. No structure has yet been solved for FMRP with a KH-specific target RNA ligand.
The RRE for members of the insulin-like growth factor 2 mRNA-binding protein (IGF2BP) family has been identified as CAUH (H= A, C, or U) by PAR-CLIP. RIP-chip experiments, although they don’t have the same site-specific resolution, found multiple CA-dinucleotide-containing motifs enriched. Another group obtained an RNA ligand from SELEX experiments binding to the KH3 and 4 domain of IGF2BP1, which contain MCAY (M= A or C) but also an additional G-rich element. They were able to obtain NMR solution structures of two complexes, in which the order of the MCAY and G-rich elements could be interchanged. In all studies, it appears that a CA dinucleotide is prominent among the targets identified for IGF2BP proteins, albeit the breadth of its functionally relevant targets remains unclear. Given that IGF2BP proteins are known to regulate gene expression of proliferation and differentiation pathways, it is not surprising that two members (IGF2BP1 and IGF2BP3) have been found highly expressed in some aggressive cancers. Thus a more clear understanding of the IGF2BP mRNA targets should be helpful in defining the role of these RBPs in human disease.
The RREs of many mRBPs comprise only short sequences that can frequently occur by chance in longer RNAs, which in turn is consistent with the many target sites identified in transcriptome-wide studies. Even though these RREs are short, they are nevertheless recognized sequence-specifically, as many biochemical follow-up studies have shown. Some RBPs, however, depend on specific cellular mechanisms that direct their association to only a subset of RRE-containing cellular targets. Gemin2 is such a protein factor that directs the specific assembly of Sm protein rings on snRNAs at their RRE composed of AUUUUUA, which is also present in many other RNAs. Gemin 2 binds the pentameric, pre-assembled Sm protein core on snRNAs in an snRNA-dependent cooperative manner. hnRNP and SR proteins are also known to bind short degenerate sequences cooperatively, to achieve specificity.
Additional complexity is added by the many RBPs that contain combinations of RNA-binding domains in modular organization. The contribution of individual RBDs, and their relation to each other in target recognition remains unknown. Multidisciplinary approaches combining broad-scale studies and conventional biochemistry in vivo and in vitro are therefore essential for the elucidation of binding specificity of multi-RBD RBPs, and to determine whether basic rules of RNA-protein interactions and dominance of specific RBDs can be established.
Following the discovery of specific binding sites on mRNAs for a given RBP, functional assays are required to assess the impact of binding on posttranscriptional regulation. These assays traditionally comprise capturing changes in mRNA stability or alternative splicing. More recently, a new assay to address the role of RBPs on mRNA translation was introduced termed ribosome profiling. It consists of an improvement of a ribosome footprinting approach by using deep-sequencing, which recovers mRNAs from translating ribosomes for inferring rates of protein synthesis[49,50] (Figure 2B). Nuclease-digested ribosome-associated mRNAs are biochemically fractionated by density centrifugation and oligonucleotide size selection, converted to a cDNA library and deep-sequenced. Total RNA, as normalization control, is similarly processed for sequencing and the ratio between recovered ribosome protected mRNA fragments and the density of sequence reads of the same mRNAs from the control are compared. Prior to lysing cells, addition of different drugs can be used to stall ribosome translation, at initiation or elongation, and enable precise mapping of (alternative) translation initiation sites and measurement of translation kinetics.
Ribosome profiling gives a measurement of actively translating RNAs and provides a sensitive tool to assess changes to the nascent proteome within different cell types or conditions. The approach also identified a class of short highly-translated ORFs termed short polycistronic ribosome-associated RNAs (sprcRNAs), within previously annotated regions of long-intergenic non-coding RNAs (lincRNAs). Since the metric used for ribosome profiling studies normalize ribosome protected fragment sequences over RNAseq data across transcripts, it is not well understood if 5′-UTR scanning by the ribosome is discernible from true sites of active translation. Neither the functions of many lincRNAs are well characterized nor were protein products of sprcRNAs isolated.
Given that alternative 3′ UTRs can dictate the repertoire of RBPs that associate and regulate mRNAs, the detailed annotation of the 3′ ends of mRNAs represents an area of active research. Alternative cleavage and polyadenylation of the 3′ UTRs of mRNAs is closely tied to cellular function and growth and various methods were developed to capture and sequence the 3′ UTRs of mRNAs, typically beginning with an oligoT-based approach[51–54]. Shortening of 3′UTRs and usage of alternative polyadenylation sites have been implicated as mechanisms to evade posttranscriptional gene regulation in cancer. Understanding the regulation of alternative 3′ UTRs of mRNAs in higher eukaryotes should provide additional and important determinants for assessing the potential regulatory impact of RBPs on its targets in human disease.
The path of gene expression, from transcription to translation, is a multi-step and ordered series of regulated events for which branched pathways are only now being uncovered. Unlike transcription, most post-transcriptional responses are rarely over 2-to-4 fold changes, at least within the resolution of current assays. Nevertheless, many genetic diseases have been identified where mutation of RBPs or related processing factors, most of which are ubiquitously expressed, still result in a large spectrum of diseases. Prominent among these are neurological, neuromuscular disorders, metabolic and mitochondrial diseases and human cancers. One of the more challenging problems with many of these diseases and the RBPs implicated, are the striking tissue-specific effects they have despite knowing that the RBPs of interest have a greater breadth of expression than the affected tissues of a given disease. RBPs, like FMR1, have pleiotropic syndromic clinical manifestations that extend beyond its well-documented importance to human cognition and memory. With respect to cancer, developmentally regulated RBPs such as LIN28 family members (LIN28A and LIN28B) or MSI1 (Musashi) are found frequently overexpressed in certain types of tumors. LIN28 expression is developmentally regulated and is normally expressed in embryonic cell types. The upregulation of LIN28A or LIN28B correlates with advanced tumors and poor prognosis. MSI1 is normally expressed in multi-potent stem cells in the brain, breast, and hair follicles but found overexpressed in medulloblastomas, chronic myelogenous leukemia, and acute myeloid leukemia[56–58]. Recent work with Musashi indicates that it can regulate critical genes involved in cancer-related pathways. It is therefore plausible that many RBPs affect the same gene networks across different cellular contexts and that such gene pathways, when dysregulated, manifest as different seemingly unrelated phenotypes (or are compensated in a tissue-specific fashion). To the extent that basic modes of post-transcriptional gene regulation can be studied in any cellular system, the phenotypic impact is variable and remains subjective to transcriptomic contexts. Accordingly, such complexities of varying transcriptome and post-transcriptional regulatory networks must be investigated with an eye towards network analysis and systems-oriented multi-disciplinary strategies, in conjunction with focused validation studies.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.