|Home | About | Journals | Submit | Contact Us | Français|
RNA binding proteins (RBPs) are involved in several post-transcriptional stages of gene expression and dictate the quality and quantity of the cellular proteome. When aberrantly expressed, they can lead to disease states as well as cancers. A basic requirement to understand their role in normal tissue development and cancer is the build of comprehensive gene expression maps. In this direction, we generated a list with 383 human RBPs based on the NCBI and EMSEMBL databases. SAGE and MPSS were then used to verify their levels of expression in normal tissues while SAGE and microarray datasets were used to perform comparisons between normal and tumor tissues. As main outcomes of our studies, we identified clusters of co-expressed or co-regulated genes that could act together in the development and maintenance of specific tissues; we also obtained a high confidence list of RBPs aberrantly expressed in several tumor types. This later list contains potential candidates to be explored as diagnostic and prognostic markers as well as putative targets for cancer therapy approaches.
A decade of studies using high throughput genomics and proteomics data revealed that cancer cells differ from normal cells in both RNA and protein contents. Presumably, a substantial part of this difference originates from gene expression regulators. Mutations affecting their function as well as events that alter their expression levels can lead to a cascade of effects, compromising the expression of several direct and indirect target genes. Not surprisingly, a substantial number of tumor suppressor genes and oncogenes are in fact transcription factors (reviewed in ref. 1). Using the same rationale, post-transcriptional regulators are expected to be listed in the same categories. In fact, a good portion of aberrant protein expression in tumors has its origin at the post-transcriptional level. For instance, studies with lung adenocarcinomas have shown that there is only a 21% correlation between the transcriptome and the proteome in these cells.2
The two major types of regulators of post-transcriptional events are RNA binding proteins (RBPs) and non-coding RNAs, especially microRNAs (reviewed in ref. 3). With regard to RBPs, overexpression of several of them (YB1, hnRNPA1, PABP2, La, hnRNPE2, etc.) has been observed in different primary tumors and cancer cell lines.4–6 However, their role in tumor formation and progression is still poorly understood; careful analyses addressing the subject are rare. A good example is provided by a study carried out by Jeff Ross’ group on CRD-BP, an oncofetal protein that is known to be present in colon cancer (81%), breast cancer (58.5%) and sarcoma (73%).7 In this study, CRD-BP was expressed in mammary epithelial cells of adult transgenic mice. The incidence of mammary tumors was 95% and some of the tumors metastasized. The authors concluded that CRD-BP functions in fact as an oncogene.8 Four other recent studies demonstrated that, RBM3,9 PTB,10 Musashi111,12 and ASF/SF2,13 whose aberrant expression is seen in different tumor types, also have oncogenic properties. On the other hand, it has been shown that RMB5 or Luca-15 potentially functions as a tumor suppressor by increasing apoptotic signals and inhibiting cell proliferation.14,15
RBPs contribute to the quality and quantity of the proteome of a given cell by modulating cellular processes like splicing, translation, RNA stability, RNA transport and localization. Aside from the splicing process, RBPs function mainly by binding to cis-regulatory elements located on the untranslated regions (UTRs) of target mRNAs. In this regard, it is important to acknowledge the connection between UTR-mediated regulation and cancer. Approximately 10% of all mRNAs have atypically long 5′ UTRs, in most cases containing a variety of regulatory elements. 75% of them encode oncogenes and genes implicated in cell growth, death and proliferation (reviewed in refs. 16 and 17).
Mapping the expression of RBPs in normal and cancer tissues constitutes an important step towards a full understanding of their participation in normal tissue development and tumorigenesis. In this direction, we generated a comprehensive list of RBPs and performed a systematic analysis using SAGE (Serial Analysis of Gene Expression), MPSS (Massively Parallel Signature Sequencing) and microarray data. We provide here a detailed expression map of a representative group of RBPs in 33 different adult tissues. Moreover, an extensive comparison among different collections of normal and tumor tissues allowed us to generate a broad list of RBPs that are aberrantly expressed in several tumor types.
Although various databases provide lists of proteins/genes organized by functional domains, it is becoming evident that these lists are either incomplete or present redundant information. In order to circumvent these problems and create a comprehensive list of human RBPs, we searched both the NCBI and EBI databases for proteins containing the most characteristic domains that interact with RNA (RRM or RBD, dsRBD and KH) as well as for proteins whose description included the key word “RNA binding”. The files derived from each individual database were compared and compiled. Redundant proteins or alternative splicing variants were consolidated. Finally, a total of 932 protein IDs were grouped into 383 RBP genes (clusters) (Suppl. File 1). Even though, our collection is one of the most complete sets of RBPs available, it is far from being complete. As proteins become better characterized, more examples of proteins that do not have a classic RNA binding domain, but do interact with RNA are identified.
A Gene Ontology analysis using the cellular component parameter indicated that out of 383 proteins, 205 are supposed to have strict nuclear localization or nuclear/cytoplasmic localization (Suppl. File 2). Details about the list preparation are described in the methods section.
SAGE and MPSS are methodologies that have been used extensively to map gene expression in numerous tissues and to identify proteins implicated in tumorigenesis.18–22 The main advantage of these technologies is the possibility of performing multiple comparisons involving large sets of genes in a variety of tissue types. We combined these methods to map the expression of our list of RBPs in normal tissues.
In order to inquire the SAGE and MPSS libraries, each RBP transcript must be linked first to a reliable tag sequence (tag to gene assignment). Each tag short nucleotide sequence) corresponding to a transcript is used to determine the relative frequency of a given gene in a particular tissue or cell line. After listing all the tags for the RBPs present in our list, we employed an additional step to eliminate tags that present problematic sequence and/or ambiguous tag to gene assignment (see Methods for details). The analysis of some genes can be compromised due to problems with tag sequences23 or due to tag to gene assignment.24 There are cases for instance of genes that do not have a unique SAGE or MPSS tag. If a tag is shared by two different genes, the final tag counting will actually reflect the expression of these two genes. The best option in these cases is to eliminate the problematic tags/genes from the analysis. The screened tags cover a total of 363 RBPs; 305 RBPs have reliable tags for both SAGE and MPSS.
The reliable “tags” were used to determine the expression pattern of their corresponding RBPs in 31 normal adult tissues and in embryonic stem cells and placenta. We used MPSS libraries only in cases where a given tissue type was not covered by the SAGE libraries. The libraries and tags used in our analysis are listed in Supplementary files 3 and 4. Unfortunately, SAGE and MPSS data cannot be compared directly due to technical differences in the preparation of the libraries. This means that although vertical comparisons are possible (comparisons to assess differences in expression of two or more RBPs within the same tissue), horizontal comparisons (comparisons between different tissues to determine differences in expression of the same RBP) are only possible if they are done either between two or more SAGE libraries or between two or more MPSS libraries.
Figures 1 and and22 summarize the data we obtained for the 305 proteins that have both SAGE and MPSS tags. We did not take into consideration relative expression levels; proteins were listed as long as the numbers indicated their presence in a given tissue. Figure 1 shows the number of RBPs that were found to be expressed in each individual tissue. In Figure 2, the graph (number of RBPs vs. number of tissues) shows the distribution of the RBPs according to the number of tissues where they were found to be expressed. The raw data for Figures 1 and and22 are represented in Supplementary files 5 and 6 and the complete SAGE and MPSS analyses in Supplementary files 7 and 8. As can be observed, the great majority of the RBPs can be considered to be ubiquitously expressed; 227 RBPs are expressed in 10 tissues or more while proteins that have their expression restricted to only one tissue constitute ~2.5% of the RBPs analyzed. These “tissue specific” proteins are listed in Supplementary file 9.
Finally, we performed a hierarchical clustering analysis to identify RBPs with similar pattern of expression in normal tissue—Supplementary file 10. These RBP clusters could be used to identify functional protein groups that co-regulate gene expression in a tissue specific fashion. The results corroborate this idea by showing that the clusters’ distribution follows the tissue embryonic origin. In this case, the RBP clusters could be categorized as ectodermic (cerebellum, astrocyte, cortex, stomach and colon), endodermic (thyroid, prostate, breast, lung and liver) and mesodermic (vascular endothelium and white blood cells).
As discussed in the introduction, since RBPs contribute substantially to the quality and quantity of the protein content of cells, it is reasonable to think that RBPs could be participating in tumor formation and progression. However, the number of RBPs described so far as either tumor suppressors or oncogenes is extremely small. The lack of studies specifically designed to understand the involvement of RBPs in cancer is responsible, at least in part, for this scenario. Although aberrant expression in tumor samples can not be considered as a definitive evidence of participation in tumorigenesis, it has been proved to function as a strong indicator. In fact, all the RBPs named in the introduction to be acting as either potential oncogenes or tumor suppressors were first identified in studies that showed that they were aberrantly expressed in a high number of tumor samples. Therefore, studies targeting the identification of RBPs that are either up or downregulated in tumor samples can direct cancer biologists to genes with high chances of being directly involved with malignancy. In this direction, we performed a study combining SAGE and microarray datasets. In order to increase the confidence on the results, we performed the analysis in two steps. First, we used SAGE libraries to perform comparisons between 11 normal and tumor tissues including: astrocyte (brain), cerebellum, cortex (brain), breast, colon, liver, lung, prostate, stomach, thyroid and white blood cells. A list was prepared with RBPs that were found to be up or downregulated (three-fold minimum cut off) in tumor samples. To increase our confidence, this list was rescreened using a different approach. We searched ONCOMINE (http://www.oncomine.org), a web-based resource that stores results of microarray studies. For most tumor types, SAGE and Oncomine use the same type of classification; in these cases, a direct comparison was made. Whenever there was an agreement between the SAGE and the Oncomine data, the RBP was passed to a final list. RBPs with contradictory microarray data or none listed data were eliminated. Due to differences in categorization, the RBPs that were identified by SAGE to be aberrantly expressed in tumors derived from neuronal tissues (astrocyte, medulla and cortex) had to be analyzed in a different way. We first compiled the RBPs that were determined to be aberrantly expressed in all three tumor types; they all correspond to upregulated proteins; we then retrieve data for these RBPs in microarray studies done with brain tumors. Finally, we passed to the final list only the RBPs determined to be upregulated in brain tumors in a minimum of three microarray studies. Table 1 contains the entire list of proteins for the combine SAGE-Oncomine analysis. Supplementary files 11 and 12 contain the detailed information regarding the SAGE analysis, while Supplementary file 13 contains the raw information for Oncomine analysis. Since SAGE and Oncomine only deal with transcriptomic data, we also collected protein expression data to corroborate some of our findings. We conducted a shotgun proteomics analysis of the highly tumorigenics lines: Daoy (medulloblastoma)25 and U251 (glioblastoma). We used APEX26 to estimate absolute protein concentrations. Of the 36 RBPs upregulated in brain tumors, 18 were present in Daoy and/or U251 data (Table 2). The 18 proteins were, on average, expressed at higher levels than other proteins detected in the respective datasets; in three datasets this difference was significant (t-test, p-value <0.05).
It is expected that the chances of a given RBP to be involved in tumor formation increase if its aberrant expression is observed in several distinct tumor types. In order to identify this type of proteins, we compiled all the data obtained with the SAGE vs. Oncomine comparison. The results are illustrated in Table 3 and discussed below. We used then the literature based software Pathway Studio 6.0 (Ariadne Genomics, Inc.,) to verify any associations between the proteins listed in Table 3 and cancer related processes (apoptosis, cell cycle, cell proliferation, cell differentiation, cell survival and cell growth). The results are represented in Table 4 and several of the pertinent references are listed in Supplementary file 14. We should highlight the RBPs that have a high number of connections to distinct cancer related processes: PTBP1, CSTF2, SSB, NONO, PNPT1, ADAR, TACC1, ACO1, APOBEC1, NPM1 and RPS5.
One important issue in cancer is the identification of changes in the proteome that leads cells to metastasize. Focal adhesions (FA) play a critical role during cell invasion. They constitute specialized attachment and signaling centers that form at sites of cellular contact. Hoog et al.27 used a mass spectroscopic method named SILAC to identify and quantify proteins interacting in an attachment-dependent manner with focal adhesion proteins. This study revealed that a large portion of the proteins identified is constituted by RBPs, among them several are present also in the list of identified aberrantly expressed RBPs (Table 1): NONO, RPL7, RPL13, RPL22, RPL28, PTBP1 and U2AF.
We have successfully mapped the expression of more than 300 RBPs in both normal and tumor tissue. We expect this dataset to become a powerful tool to help with the design of approaches intended to comprehend the function of RBPs in development, tissue differentiation and tumorigenesis.
The mapping of RBP expression in normal tissue revealed that the majority of the RBPs we analyzed are ubiquitously expressed. However, clear differences in expression levels among tissues could be noted for numerous proteins. This data reflects first, the well-known fact that several RBPs take part of basic molecular machineries like the spliceosome, responsible for controlling essential aspects of RNA processing in all cell types. Second, our data corroborates the idea that differences in the concentration of specific RBPs constitute the main mechanism to achieve variations in gene expression among tissues. Although, tissue specific RBP have important regulatory roles, our data suggests that they should have a minor contribution in the generation of tissue specific profiles.
Analysis of RBP expression in the developing brain showed that many RBPs have a similar pattern of neuronal expression; fact that indicates that multiple RBPs function concurrently to regulate the expression of specific RNA subsets.28 In agreement with this observation and corroborating the important role RBPs have in development and tissue differentiation, we determined that libraries from tissues of the same embryonic origin form clusters when organized in function of RBP expression (Suppl. File 10).
An important chapter of our analysis related to the identification of RBPs potentially involved in tumor formation. The fact that a protein is aberrantly expressed in several tumor types suggests a participation in cancer that deserves to be investigated. A combined analysis employing assays designed to test oncogenic or tumor suppressor properties as well as high throughput methods to determine the direct RNA targets of selected RBPs would be ideal to elucidate the impact of RBPs in tumor formation and progression (reviewed in ref. 3). The development of recent tools and methods like the RIP-Chip/Ribonomics assay, CLIP and alternative splicing microarrays29–32 will accelerate the understanding of RBPs participation in tumorigenesis.
On the list of RBPs aberrantly expressed in multiple tumor types, we observed that two protein families are particularly enriched. Ribosomal proteins constitute the first group. In agreement with the data we obtained, several recent reports have pointed to a possible participation of members of this family in tumor formation.33–36 It has been known for quite a while that ribosomes can vary in protein composition. Recent data from Pamela Silver’s laboratory showed that distinct ribosomal protein can affect the expression of specific gene subsets in yeast.37 Having said that, we suggest that an increase in expression of a specific group of ribosomal proteins could alter the cellular proteome in a qualitatively and quantitative manner; what could alter the expression of genes directly linked to proliferation, apoptosis and other cancer related processes. Another possibility is that ribosomal proteins could contribute to tumor formation by acting in other cellular processes outside translation. In fact, it was proposed several years ago that a group of ribosomal proteins may function as cell cycle checkpoints.38 The RBM family was the second group we identified as being enriched among proteins overexpressed in different tumor types. Most of its members are poorly characterized with the exception of RMB5, also known as Luca-15,15 and RBM3.9 While RMB5 is apparently involved in apoptosis and has been suggested to function as a tumor suppressor, RBM3 was suggested to function as a proto-oncogene. It remains to be investigated if other family members are also involved in tumor formation.
In respect to RBPs that were found to be upregulated in all three tumor types of the nervous system, NOL8 comes up as the first highlight. Among all identified proteins, it is the only one whose ratios (tumor vs. normal) exceed 10 fold in all three tissues in SAGE analysis. Despite the fact that the function of this protein is poorly understood, there is one report connecting NOL8 and cancer. Knockdown of this protein in three different gastric cancer cell lines affected cell growth and increased apoptosis.39 The presence of MSI1 was somehow expected since high levels of expression were previously observed in glioblastoma, medulloblastoma and astrocytoma.40–44 We have recently shown that MSI1 is potentially involved in medulloblastoma formation through its role as a regulator of “cancer stem like cells”.11 A few proteins were identified to be upregulated in other tissues as well as in brain tumors; these are the case of DKC1, EIF3S9, HNRPUL1, NONO, NPM1, N0L8, LA, PAIP1, RBM28, RO60, C20orf119, CPSF6 and RPS5. For most of them, we identified studies corroborating their upregulation in distinct tumor types; in the case of CPSF6, NPM1, NONO, NOL8 and RPS5, connections to cancer related processes have been also described. CPSF6 is a cleavage factor required for 3′ RNA cleavage and poly-adenylation processing. CPSF6 was recently described as being part of the “Poised Gene Cassette”, a set of cancer specific genes exhibiting precise transcriptional control in solid tumors whose expression could influence metastasis.45 NPM1 or Nucleophosmin is a relatively well studied gene in the context of cancer biology; being the most frequently mutated gene in acute myeloid leukemia (AML). NPM1 has been defined both as a putative proto-oncogene and tumor suppressor; it functions in several cellular processes that include ribosome biogenesis, regulation of chromosome duplication and cell proliferation (reviewed in ref. 46). In a study for bladder cancer marker identification, NONO was strongly correlated with vascular invasions and associated with a decreased probability of survival.47 RNAi knockdown of NOL8 inhibited cell growth of HeLa cells48 and induced apoptosis in three diffuse-type gastric cancer cells, St-4, MKN45 and TMK-1.39 Alteration in the pattern of expression of RPS5 was observed during differentiation and apoptosis in murine erythroleukemia cells.
In conclusion, the gene expression map that we produced in a total of 33 normal tissues will be important for the identification of RBPs and clusters of RBPs that are potentially required for tissue specific development and maintenance. In regard to cancer, RBPs that have altered expression in a large number of tumor tissues appear as future candidates to be explored as diagnostic and prognostic markers and to be tested as target candidates in cancer therapy approaches.
The list of human RBPs was obtained from the EBI-InterPro (www.ebi.ac.uk/interpro/) and NCBI protein database (www.ncbi.nlm.nih.gov/sites/entrez?db=protein). In the EBI, we searched for proteins containing the most characteristic domains that interact with RNA (RRM or RBD, dsRBD and KH). In NCBI we also searched for proteins whose description includes the key word “RNA binding.” A manual inspection was performed to exclude non-RBP sequences.
We performed a clustering of all sets of sequences present in our list of RBPs; each cluster corresponds to one RBP gene. First, we mapped all RBP sequences into the genome using BLAT.50 Second, we employed a cluster algorithm to analyze the genome mapping and to merge those sequences containing the same exon-intron structure (reviewed in ref. 51). Third, based on sequence annotation, we attributed a gene name to each RBP cluster. Finally, we made a semi-automatic analysis of all clusters, checking the sequence groups and verifying gene annotation.
The cellular localization of RBPs was determined through the gene ontology information (GO).52 We made an association between the gene name and the GO cellular component (CC) term. All RBPs classified as “location in the nucleus” were selected.
SAGE libraries were downloaded from SAGEGenie.40 MPSS libraries were obtained from http//mpss.licr.org. All tags were normalized as described in ref.21 Reliable tags to RBP genes were selected through of tag to gene information downloaded from ACTG.24 For most analyses presented in this paper, we selected only the 3′ most tag from mRNAs containing a poly(A) tail or poly(A) signal. Tags mapped to two or more genes were discarded.
Hierarchical clustering analysis is commonly used to identify patterns in a dataset. We used this method to analyze the expression profile of RBPs in the samples of normal tissues. The hierarchical clustering was performed by the heatplus package of R (http://www.r-project.org/) using Euclidean distance for dissimilarity between elements.
Comparison of gene expression between tumor and normal tissue was performed based on SAGE tag frequency. The identification of genes differentially expressed in normal and tumor were done through three steps: (i) we used a local implementation of Monte Carlo simulation method described in53 to generate a list of genes differentially expressed (only differential expression supported by a p-value <0.05 were selected); (ii) from the list of genes differentially expressed, were classified as overexpressed in cancer those RBPs whose tags presented a cancer vs. normal ratio greater than three; (iii) from the list of genes differentially expressed, were classified as under expressed in cancer those RBPs whose tags that presented a normal vs. cancer ratio greater than three. Only tissues containing both cancer and normal SAGE libraries were analyzed. All SAGE libraries used in this analysis are listed in Supplementary Material.
All RBPs that showed differential expression in the SAGE normal vs. tumor analysis were re-screened. These proteins were then checked against the Oncomine (http://www.oncomine.org) database of microarray studies using a p-value <=0.01. RBPs made the final list as long SAGE and Oncomine information matched. In cases of conflicting data (i.e., similar microarray studies showing both up and down-regulation) or lack of information, the RBP was discarded.
The data for the cytosolic fraction of the Daoy cell lines was taken from ref.25 In the case of the U251 cell line, we analyzed two biological replicates, divided into cytosolic and pellet fraction. Experimental procedures are identical to those described in.54 All raw data is deposited at http://www.marcot-telab.org/MSdata/.
We would like to thank Carolina Livi, Yufei Huang, Julia P.C. da Cunha, Robson F. de Souza, Yufei Xiao, Suzanne Burns, Dat Vo and Tarea Burton for comments; Vishy Iyer and Patrick Killion for discussions on the experimental design and Lauren Johnston and Marilyn Asher for helping with the literature search. P.A.F.G. was supported by a PhD fellowship from FAPESP. P.A.F.G. was supported in part by grant 5D43TW007015-02 from the Fogarty International center, NIH. M.G. was supported by a fellowship of the Medical School (UTHSCSA). C.V. was supported by the International Human Frontier Science Program.