|Home | About | Journals | Submit | Contact Us | Français|
RNAdb is a comprehensive database of mammalian non-protein-coding RNAs (ncRNAs). There is increasing recognition that ncRNAs play important regulatory roles in multicellular organisms, and there is an expanding rate of discovery of novel ncRNAs as well as an increasing allocation of function. In this update to RNAdb, we provide nucleotide sequences and annotations for tens of thousands of non-housekeeping ncRNAs, including a wide range of mammalian microRNAs, small nucleolar RNAs and larger mRNA-like ncRNAs. Some of these have documented functions and/or expression patterns, but the majority remain of unclear significance, and include PIWI-interacting RNAs, ncRNAs identified from the latest rounds of large-scale cDNA sequencing projects, putative antisense transcripts, as well as ncRNAs predicted on the basis of structural features and alignments. Improvements to the database comprise not only new and updated ncRNA datasets, but also provision of microarray-based expression data and closer interface with more specialized ncRNA resources such as miRBase and snoRNA-LBME-db. To access RNAdb, visit http://research.imb.uq.edu.au/RNAdb.
The mammalian genome encodes thousands of non-protein-coding RNAs (ncRNAs). Ribosomal RNAs (rRNAs), transfer RNAs (tRNAs) and small nuclear RNAs (snRNAs), fulfil mainly housekeeping roles in mRNA translation and splicing. Small nucleolar RNAs (snoRNAs) and the related small Cajal body-specific RNAs (scaRNAs) guide modifications of other RNAs (1–3). MicroRNAs (miRNAs) regulate gene expression by controlling mRNA translation and turnover (4,5), PIWI-interacting RNAs (piRNAs) are thought to be important in spermatogenesis (6,7), while larger ncRNAs have been discovered to be developmentally regulated (8–10) and to function in a range of processes including genomic imprinting, intracellular protein trafficking and brain development (11–16). The abundance of ncRNAs has only become apparent in the past few years and was largely unexpected. Although many recently identified ncRNAs remain of unknown function, and appear to be evolving rapidly (15,17), it is increasingly clear that ncRNAs represent a diverse and important class of functional output from mammalian genomes.
RNAdb is a comprehensive database of mammalian ncRNAs. The focus of the database is on ncRNAs that have restricted expression and whose function is likely to be regulatory. Housekeeping RNAs (rRNAs, tRNAs, snRNAs) are not included and are covered elsewhere (18,19). The aim of the database is to provide a nucleotide sequence-based platform to facilitate both bioinformatic and experimental research in the burgeoning field of RNomics. Already, RNAdb has been used to develop machine-learning algorithms for identifying ncRNAs (20), annotate transcripts from a large-scale transcriptome project (8) and examine ncRNA evolution (17). In addition to containing sequence data, individual ncRNA entries in RNAdb are annotated based upon publicly available information in the literature or secondary databases. In this way, the database can also be browsed or searched by the casual user interested in learning more about particular ncRNAs.
Since the original release of RNAdb two years ago (21), the number of known mammalian ncRNAs has grown considerably. In recognition of this growth, we have updated the database to include tens of thousands of novel ncRNAs. Some of these have been characterized in isolation, continuing the trend of ad hoc discovery by which many earlier ncRNAs were identified. The majority, however, comes from large-scale cloning and sequencing studies or structural alignment-based predictions. As well as incorporating new ncRNA datasets, the current release of RNAdb provides other enhancements, including microarray-based expression data, closer interface with specialized ncRNA resources such as miRBase and snoRNA-LBME-db (3,22), and the availability of data for use as custom tracks on the UCSC Genome Browser (23).
RNAdb is available on-line at http://research.imb.uq.edu.au/RNAdb. Currently, datasets are stored in relational form in a Microsoft SQL2005 database. The web application is multilayered with the primary presentation layer implemented in C# 2 under the ASP.NET 2.0 framework. The application layer is implemented as a mixture of C# and C++ modules with dataset normalization performed through SQL stored procedures.
The database can be accessed or queried in various ways. Users can casually browse the collection. Specific searches can be performed using keywords (with or without Boolean operators) and/or by applying filters across nominated fields. BLAST searches permit users to locate regions of similarity between sequences of interest and those stored in the database.
To facilitate links with other on-line resources, users can now directly go to a detailed view of an entry by using the following URL and substituting the RNAdb unique identifier of interest for <RNAdbID>: http://jsm-research.imb.uq.edu.au/rnadb/default.aspx?ncrna=<RNAdbID>. For example, if a user wishes to look at the detailed view for MIR1004, one would use http://jsm-research.imb.uq.edu.au/rnadb/default.aspx?ncrna=MIR1004.
The entire database is available for download in either FASTA or XML format via the website. Specific datasets are also provided as custom tracks for loading directly into the UCSC Genome Browser (23). This feature allows users to easily take a defined subset of the database and readily apply it to the UCSC Genome Browser's extensive set of comparative and analytical tools.
ncRNAs in RNAdb are divided into several distinct datasets (Table 1). This decision was made not only to reflect the different ways in which ncRNAs have been identified, but also in recognition that users may want to separately query and download each set. A description of each dataset is provided below.
Over 1800 mammalian miRNAs are found within RNAdb. These sequences were obtained from the latest release of miRBase (release 8.2, July 2006) (22). miRBase is the central repository for miRNA data on the web and is regularly maintained. We have elected to directly link the RNAdb miRNA entries to miRBase, so as to keep abreast with the most recent annotations and updates.
RNAdb contains more than 500 mammalian snoRNAs and scaRNAs. The snoRNAs fall into two general classes, C/D box and H/ACA snoRNAs, which classically guide ribose methylation and pseudo-uridylation of rRNAs, respectively. Interestingly, some snoRNAs appear to regulate other RNAs, including HBII-52 which regulates the alternative splicing of the serotonin receptor 2C and is implicated in the pathogenesis of Prader-Willi syndrome (24). Human snoRNAs and scaRNAs in RNAdb were derived from snoRNA-LBME-db (release 3, August 2006) (3), and annotations for these sequences are maintained by linking out to this informative and specialized resource.
The PIWI family of proteins is known to be important for germ cell development. PIWI proteins were recently discovered to bind thousands of small RNAs, termed piRNAs (6,7). piRNAs have been identified in testis, 26–31 nt in length, and are distinct from miRNAs. Over 88 000 piRNA candidates have been cloned and sequenced from mouse, human and rat, and are included for the first time in the current release of RNAdb.
This dataset contains more than 900 unique ncRNA sequences which have been identified and manually curated based upon extensive literature review. A majority of ncRNAs listed here are much longer than those listed above. Altogether, 36 mammalian organisms are represented but most ncRNAs are either from mouse or human. Although some of these transcripts have documented biological roles, most are transcripts of unknown function. As well as sequence data, additional information–including GenBank accessions, references, chromosomal location, transcript length, splicing status, conservation notes, function, disease associations, antisense relationships, imprinting status and tissue expression patterns-is provided wherever possible in separate searchable fields.
New additions to this dataset include multiple long mRNA-like ncRNAs whose functions have recently become apparent. For instance, NRON, an ncRNA repressor of the nuclear factor of activated T cells (NFAT), regulates nuclear trafficking of NFAT (13). Taurine upregulated gene 1 (TUG1) is required for photoreceptor development in the eye (25). Saf, which lies antisense to the death receptor Fas, alters the expression of alternative Fas isoforms and increases resistance to Fas-induced apoptosis (26). Evf-2, an ncRNA derived from an ultraconserved element, cooperates with the homeodomain protein Dlx-2 to augment the transcriptional activity of a nearby enhancer (27). Serving as a salient reminder that many ncRNAs are poorly conserved (17) is HAR1F, which is expressed specifically in Cajal-Retzius neurons in the developing human neocortex and has evolved rapidly in the human lineage (15).
Using full-length cDNA cloning and sequencing strategies, the Functional Annotation of Mouse (FANTOM) project has identified thousands of novel transcripts from the mouse genome (8). In the most recent round of annotation, 34 030 cDNAs were manually annotated as putative ncRNAs (28), a subset of which were subsequently shown to be derived fragments of very long ncRNAs (29). Since both cloning and manual human annotation is subject to variation and error, the true number of ncRNAs remains unclear. To this end, we provide the results of various computational prediction strategies for use as additional filters in identifying ncRNAs (Supplementary Data 1). In addition to sequence data, details such as the Riken clone identifier, GenBank accession, genomic location, transcript length, likely imprinting status and library of origin are provided. RNAdb also incorporates expression information from publicly available microarray datasets such as GNF SymAtlas (30) (Supplementary Data 2). Although limited to only a small proportion of FANTOM3 ncRNAs, this information allows the identification of transcripts that are dynamically expressed across various tissues and cell types, and is expected to provide a useful starting point for their further characterization.
As indicated earlier, the vast majority of ncRNAs identified from large-scale cDNA sequencing projects are of unknown significance. A recent screen of several hundred, well-conserved FANTOM ncRNAs identified not only NRON, but also seven other functional ncRNA genes essential for cell viability or involved in Hedgehog signalling (13,31). Given that this strategy employed only a limited number of cell-based assays and that only a tiny proportion of ncRNAs were examined, it would appear likely that many more functional ncRNAs from the FANTOM collection remain to be uncovered in the future.
This dataset contains more than 1700 putative ncRNAs from the latest round of the Human Full-length cDNA Annotation Invitational (H-Invitational) project (release 3.4, August 2006) (32). Non-protein-coding transcripts are defined in this dataset by the absence of any open reading frame and by not belonging to the pseudogene classification. In addition to the sequence data, details such as the GenBank accession no., genomic location, transcript length, library of origin and expression data (based upon publicly available microarray data where present; see Supplementary Data 2) are also listed.
Recently, a number of studies have identified thousands of putative ncRNAs based upon predicted structural features and alignments using novel comparative genomics tools. The datasets resulting from three independent approaches, RNAz (33), Non-coding RNA Search (34) and EvoFold (35), are included here. RNAz combines a comparative approach (scoring conservation of secondary structure) with the observation that ncRNAs are thermodynamically more stable than expected by chance. Using sequences conserved in at least human, mouse, rat and dog, over 35 000 structured elements were identified in the human genome (36). Non-coding RNA Search uses syntenic regions between human and mouse that are unalignable and then utilizes the FOLDALIGN algorithm to identify regions with conserved secondary structure. Finally, EvoFold utilizes a comparative genomics method based on phylogenetic stochastic context-free grammars to identify functional RNAs. Using an eight-way genome-wide alignment of human, chimpanzee, mouse, rat, dog, chicken, zebrafish and Fugu, over 47000 candidate RNA structures were identified in the human genome.
Natural antisense transcription is now recognized as being a common occurrence in the mammalian transcriptome and a means by which gene expression can be regulated (37–39). Data from tiling array experiments and sequencing of short tags representing 5′ and 3′ ends of transcripts suggest that more than 60% of all human and mouse loci may be transcribed on both strands and give rise to complementary transcripts (9,37). In its original release, RNAdb contained a dataset of putative antisense ncRNAs identified from cDNA and EST databases for human and mouse using a computational pipeline (21). Coinciding with the current release, we have recently re-developed the pipeline that searches for antisense RNAs and experimentally validated a subset of its predictions (40). We will continue to use the improved pipeline in regular updates of the antisense ncRNA dataset.
The total number of mammalian ncRNA sequences contained in RNAdb has increased ~10-fold since the database's inception 2 years ago. Such growth in content reflects the high level of interest and activity in the field over this period. Nevertheless, most of the newly added sequences represent putative ncRNAs and their biological roles, if any, remain unclear. The fact that most of the mammalian genome is transcribed suggests that there is either a great deal of transcriptional noise or that these RNAs are fulfilling some unexpected functions in mammalian biology (11,41). Although the search for new ncRNAs is far from exhausted (42), one of today's principal challenges in the field of RNomics is to explore the functional significance of this abundant non-coding transcription. If this challenge is to be successfully met in coming years, then experimental proof of function will be paramount. Such proof often comes slowly and incrementally, and having bioinformatic resources such as RNAdb will be essential to guide and facilitate future discovery.
As new ncRNAs are discovered, we will continue to update RNAdb. Submissions of new mammalian ncRNAs are invited, and should be sent to RNAdb/at/imb.uq.edu.au. We also plan to regularly synchronize our datasets with miRBase and snoRNA-LBME-db, as these resources are updated. Currently, publicly available microarray-based expression data for ncRNAs remain limited, but is likely to be significantly expanded in the future (K. Pang and M. Dinger, unpublished data). Once new expression data are released, they will subsequently be incorporated into RNAdb.
The authors would like to thank Martin Frith and Jinfeng Liu for assistance with annotation of the FANTOM3 ncRNAs, Evgeny Nudler for providing HSR-1 sequences, Jhumku Kohtz for providing EVF1 and EVF2 sequences, and Tracy Young and Constance Cepko for providing TUG1 sequences. This work was supported by funding from the Australian National Health & Medical Research Council (K.C.P), the Foundation for Research, Science and Technology, New Zealand (M.E.D), the Australian Research Council, the Queensland State Government and the University of Queensland (J.S.M.), and the Functional Genomics Programme (FUGE) of the Research Council of Norway (P.G.E. and B.L.). Funding to pay the Open Access publication charges for this article was provided by the University of Queensland.
Conflict of interest statement. None declared.