|Home | About | Journals | Submit | Contact Us | Français|
The RNA-Binding Protein DataBase (RBPDB) is a collection of experimental observations of RNA-binding sites, both in vitro and in vivo, manually curated from primary literature. To build RBPDB, we performed a literature search for experimental binding data for all RNA-binding proteins (RBPs) with known RNA-binding domains in four metazoan species (human, mouse, fly and worm). In total, RPBDB contains binding data on 272 RBPs, including 71 that have motifs in position weight matrix format, and 36 sets of sequences of in vivo-bound transcripts from immunoprecipitation experiments. The database is accessible by a web interface which allows browsing by domain or by organism, searching and export of records, and bulk data downloads. Users can also use RBPDB to scan sequences for RBP-binding sites. RBPDB is freely available, without registration at http://rbpdb.ccbr.utoronto.ca/.
RNA-binding proteins (RBPs) have a fundamental role in a wide variety of cellular processes including transcription, RNA splicing and processing, localization, stability and translation (1–6). RBPs typically contain RNA-binding domains (RBDs) such as the RNA Recognition Motif (RRM) and the K homology (KH) domain, which are among the most numerous protein domains in metazoan genomes, including the human genome (7–9). Individual RBPs often have multiple RBDs that can independently bind RNA (10), and the approximately 400 annotated mammalian RBPs contain over 800 individual RBDs (11).
Knowledge of the RNA-binding activity of RBPs is critical for mapping and understanding transcriptional and post-transcriptional networks and regulatory mechanisms. Collections of DNA-binding specificities of transcription factors are available and widely used (12,13); however, to our knowledge, there is no central repository of information on the RNA-binding activities of RBPs. Here, we introduce RNA-Binding Protein DataBase (RBPDB), a database of RNA-binding experiments. A total of 1453 in vitro and in vivo experiments on 272 proteins are included, as well as 71 binding profiles in the form of position weight matrices (PWMs) and sequence logos, and 36 sets of sequences bound in vivo in immunoprecipitation experiments.
We anticipate that RBPDB will be of use to diverse researchers. In addition to searching for RNA-binding activities by protein, domain and experiment, RBPDB also allows users to scan RNA sequences for matches to RBP binding preferences stored in RBPDB. Additionally, the collected motifs should prove invaluable for genome-wide scans to identify cis-regulatory elements involved in post-transcriptional regulation via RBPs. Finally, the inclusion of in vivo bound transcripts provides a snapshot of enriched RBP-specific mRNA targets.
RBPDB is a collection of RBPs linked to a curated database of published observations of RNA binding. The database consists of a table of proteins, linked to other proteins through orthology relationships and to one or more experiments, if experiments are found. Each protein and experiment is assigned a unique internal ID number, and proteins are linked to Ensembl, FlyBase and WormBase gene annotations and RNA-bound protein structures on PDB (14–17). Experiments are associated with a PubMed ID. Motifs, PWMs and large-scale data sets are retained as flat files that are linked to experiment and protein IDs.
To populate the database, we first cataloged known and predicted RBPs in human, mouse, Drosophila and Caenorhabditis elegans (18–26). Most proteins were selected based on the presence of known sequence-specific RBDs (Table 1), which we compiled from review papers (3,4,7,8) and from searching and scanning Pfam domain annotations (27). We retrieved protein matches to InterPro domains from UniProt and Ensembl and used the union of these two sets. Additionally, we added proteins that bind RNA through a non-canonical RBD, such as a Sterile Alpha Motif (SAM) domain or C2H2 zinc finger, based on a Gene Ontology or keyword annotation as RNA-binding in Ensembl, UniProt or NCBI. However, we did not include domains that are largely specific to ribosomal proteins (e.g. S4 domain). Moreover, some non-sequence specific, poorly characterized and/or unconventional RBDs are currently not included (e.g. dsRBD, G-patch, zinc-knuckle and zinc-ribbon) (7). Inclusion of additional domains and species is a future objective for RBPDB, and users can suggest novel domains for inclusion (see Future Directions section). We note, however, that in eukaryotes, the repertory of known and predicted RBPs is dominated by RRM and KH domains, and as such, these constitute the majority of experimental data in RBPDB.
A short text description of the RBDs in the largest isoform of the protein (e.g. RRMx2 for a protein with two RRM domains) was assigned, and links to UniProt were added where available. In addition, in order to facilitate comparison between the RNA-binding specificities of similar proteins in different organisms, we imported orthology relationships from InParanoid (28).
During the course of curation, when we encountered RNA-binding experiments for proteins in other species (such as Xenopus, yeast or rat), we added them to the database on an ad hoc basis. However, coverage of the RNA-binding proteomes of species other than human, mouse, Drosophila and C. elegans is not intended to be comprehensive.
We populated RBPDB with RNA-binding data by searching PubMed with the gene names and aliases of the aforementioned RBPs, and recording any RNA-binding data found in the retrieved papers. RBPDB currently catalogs 14 types of RNA-binding experiments. These include experiments that measure binding to a single sequence and those that measure binding to many sequences in parallel, in vivo or in vitro. A description of the categories of experiments and the number of experiments in each category is given in Table 2.
Single-sequence experiments were included where the sequence of the bound RNA could be determined and is less than 200nt in length. For these experiments, the full nucleotide sequence is included, unless a consensus motif rather than a unique sequence is reported. The consensus sequences use IUPAC (International Union of Pure and Applied Chemistry) nomenclature for representing degenerate nucleotides. Additionally, sequences with variable-length stretches or repetitive motifs are reported as (M)(X), where M is the repeated nucleotide or sequence, and X is a numerical value/range or a long undefined sequence (denoted as ‘n’). For example, the motif CUCUCU(A)(15–30)CUCUCU described for PTB contains two CUCUCU sequences separated by 15–30 adenosines (29), while (G)(n) denotes a poly(G) sequence.
For SELEX experiments, we extracted the selected sequences from the publication and aligned them as reported. We then created a position frequency matrix (PFM) from the alignment, and calculated a PWM using the Transcription Factor Binding Site (TFBS) package (30). Logos were created using the WebLogo standalone package (31). Reported motifs that contained internal gaps that would preclude representation in matrix format, or those for which >10% of the selected sequences do not match the reported motif, are reported as an IUPAC consensus motif only, as described above.
When possible, we compiled all sequences identified in large-scale in vivo binding experiments. There is considerable diversity in how these data and sequences are reported and annotated. In some cases, we were unable to recover sequences; in these cases, RBPDB refers to the original publication but does not contain the sequences. When we were able to recover bound sequences, we included a short README file to describe how the sequences were extracted from supplementary data or GEO (Gene Expression Omnibus) (32). In general, when bound sequences were detected by tiling arrays, we extracted genomic sequence from the sense strand with respect to the annotated gene located ±200bp of all reported peaks, since it is possible that pre-mRNA is bound, along with any numerical value associated with the peak (e.g. log ratio intensity). When only the identity of bound genes or transcripts is reported, we compiled the transcript or gene sequence retrieved from GenBank using BioPerl (33), or from batch download files from FlyBase, and reported this sequence along with its associated numerical value. There were a variety of different normalization and reporting strategies reported in these studies, and wherever possible, we report only normalized data rather than raw data, but we capture any associated GEO or ArrayExpress (34) identifiers to allow users to access the data directly. When there are multiple samples or controls, we report each separately. In some cases, matrices or sequence logos were reported for genome wide in vivo immunoprecipitation experiments, and are included in the database.
RBDs recognize specific RNA sequences, structures or both. RNA binding in vivo is presumably dependent on a combination of factors, including accessibility of the binding site (35) and interactions with cofactors (including other RBPs). A goal of RBPDB is to describe bound sequences with minimal interpretation, which conflicts with complications surrounding the representation and storage of RNA structure in a compact, unambiguous, computer- and human-readable format. For example, minimum free energy structures require a windowing function to select the region of RNA to fold and are too simple to represent suboptimal structures, which can be biologically functional. Therefore, in RBPDB we include only a yes/no indication of whether the original manuscripts discussed the secondary structure of the RNA. Users interested in predicting structure should consider the RNAfold webserver (among others) (36).
There are three main modes of interaction with RBPDB. The first is to search for RNA-binding experiments by RBP, by RBD, by species, by experiment type or by any combination of the above. The second is to perform bulk downloads of all RBPDB data or subsets of the data filtered in various ways. The third is to scan an input RNA sequence for potential binding sites for RBPs stored in RBPDB.
RBPDB can be searched quickly by gene name, alias or description, by entering a search term in the search box on the home page or at the top of every page. More complex queries can be executed using the advanced search form, reached by clicking the ‘advanced’ link. From here, the proteins database can be searched by gene name or symbol, organism, or RBDs by making the appropriate selections on the form. To retrieve experiment records directly, the experiments form should be used; it takes the same input, with the addition of options to search by experiment type. Figure 1 shows the results from one such search. From the results page, experimental data can be viewed and exported. Any results table can also be further filtered by partial text matches in any of the columns by clicking ‘Filter’. Columns can be sorted in decreasing or increasing order by clicking the column label.
There are two ways to download data from RBPDB. First, the annotation data corresponding to a subset of proteins or experiments resulting from a search query can be exported in plain text, comma separated values (CSV), Excel or Word formats directly from a search result table, as shown in Figure 1. The second way to download data is via the Downloads page, linked from the menu at the top of the site (Figure 2). This page has links to files that include the full annotation database in SQL, tab-delimited and CSV formats, as well as sets of transcripts bound in genome wide in vivo experiments, and binding specificity PFM and PWM matrices in a flat text file format (30). The individual protein and experiment tables are also available, as well as the linker table needed to map experiments to proteins. These files are also available for each species separately.
From the main page, users can submit nucleotide sequences to scan for matches with RBP-binding sites. This sequence can be in DNA or RNA format. Additionally, a threshold for reporting matches to the sequence can be set. At present, the sequence can only be scanned with motifs associated with full PWMs. Potential binding sites in the sequence are identified by scoring potential binding sites within the sequence using PWMs, using BioPerl (33). The PWM score for a potential binding site is the sum of the scores of each nucleotide at each position in the PWM, and the relative score is the percent of the score relative to the maximum possible score of the PWM calculated. Sites with relative scores greater than the threshold, which defaults to 80%, are reported. Figure 3 shows the results obtained for the 3′-UTR of the human c-fos gene. The RBPs TTP and members of the ELAV family have been implicated in the ARE-regulated degradation of c-fos RNA (37). The top hits are to known AU-rich element (ARE)-binding proteins ELAVL2 (HuB) and ZFP36 (TTP).
It is also possible to search all individual RNA sequences from the single-sequence experiments by entering a sequence or IUPAC consensus of interest in the search window. The search will return exact matches to the text entered.
We will periodically update RBPDB to keep it current. Each protein entry in our database will be reassessed at least once a year. RBPDB also has a user submission form that allows users to notify our curators of recent publications of RNA-binding specificities or proteins newly discovered; we will prioritize these submissions for updates. Newly-described RBDs [e.g. the nudix domain (38)] and newly described RBPs without conserved domains will be included using the search strategy used for the initial construction of the database. A related future direction for RBPDB will be the systematic incorporation of data from other species. RBPDB is currently populated only with data from metazoans, which are of special interest for biomedical research, but represent only a small minority of the eukaryotic kingdom. There is RNA-binding information for proteins in other species, particularly traditional non-metazoan model systems such as yeast (39) and Arabidopsis [e.g. (40)], and also bacteria.
It may also be possible to further populate the database by inferring RNA-binding activities. While the existence of a universal molecular ‘code’ that predicts RNA sequence specificity directly from protein sequence has proven difficult to derive (25), there is little question that proteins with very similar amino-acid sequences tend to have very similar RNA-binding activities. As such, we anticipate that one application of RBPDB will be further analysis of the relationships between protein sequences and RNA-binding activities. For these analyses, it would be invaluable for the RNA-binding activities of individual RBDs to be documented, rather than individual proteins and the bound sequences to be aligned, if possible. Indeed, the way the RNA-binding activity is represented is critical for many uses of RBPDB, including genome scanning, identification of proteins that would bind sequences of interest, and comparisons among RBPs. Therefore, an area of ongoing exploration will be the representation of RNA-binding activities, including the inclusion of domain-specific information and incorporation of RNA structure.
Canadian Institutes of Health Research (MOP-93671 to T.R.H. and Q.M.; MOP-49451 to T.R.H.); National Institutes of Health (1R01HG00570 to T.R.H.); Natural Sciences and Engineering Research Council of Canada CGS-M (to K.C.). Funding for open access charge: Canadian Institutes of Health Research.
Conflict of interest statement. None declared.
The authors are grateful to Harm van Bakel, Debashish Ray and Carl de Boer for computational support and helpful conversations.