|Home | About | Journals | Submit | Contact Us | Français|
miRBase is the central online repository for microRNA (miRNA) nomenclature, sequence data, annotation and target prediction. The current release (10.0) contains 5071 miRNA loci from 58 species, expressing 5922 distinct mature miRNA sequences: a growth of over 2000 sequences in the past 2 years. miRBase provides a range of data to facilitate studies of miRNA genomics: all miRNAs are mapped to their genomic coordinates. Clusters of miRNA sequences in the genome are highlighted, and can be defined and retrieved with any inter-miRNA distance. The overlap of miRNA sequences with annotated transcripts, both protein- and non-coding, are described. Finally, graphical views of the locations of a wide range of genomic features in model organisms allow for the first time the prediction of the likely boundaries of many miRNA primary transcripts. miRBase is available at http://microrna.sanger.ac.uk/.
MicroRNAs (miRNAs) are short RNA sequences expressed from longer transcripts encoded in animal, plant and virus genomes, and recently discovered in a single-celled eukaryote (1,2). miRNAs regulate the expression of target genes by binding to complementary sites in their transcripts to cause translational repression or transcript degradation (3). Translational repression is thought to be the primary mechanism for imperfect target duplexes in animals, with transcript degradation the dominant mechanism for largely perfect matches found throughout plant target transcripts. miRNAs have been implicated in processes and pathways such as development, cell proliferation, apoptosis, metabolism and morphogenesis, and in diseases including cancer (4,5).
miRBase is the primary repository and database resource for miRNA data. The database has three main functions:
The miRNA nomenclature scheme has been presented and discussed previously (6,8,9). Novel miRNAs require cloning or expression evidence, and should be submitted only after a manuscript describing their identification is accepted for publication. Assigned names should then be incorporated into the final version of the manuscript prior to publication. Obvious homologues of miRNAs validated in closely related species need not be experimentally verified and may be submitted at any time. Primary features of the nomenclature scheme are:
However, it is important to note that a short name cannot always encode complex information such as orthology and paralogy relationships. In some cases, the short name is a pragmatic choice that is the most consistent of conflicting representations of these sequence relationships. While the names provide a guide of family and function, they should not therefore be relied upon to confer any complex meaning. Instead, dedicated fields in the database provide information about gene and mature miRNA sequence families.
The published miRNA literature is huge. Readers are referred to a number of comprehensive reviews of miRNA structure, biogenesis and function (4,10–12). Here, we focus on specific issues and points of interest with respect to the provision of miRNA data in the miRBase database.
The number of miRNA hairpin loci in the miRBase database continues to grow rapidly, from 2909 in 36 genomes (June 2005, release 7.0) to 5071 in 58 genomes (August 2007, release 10.0) in the past 2 years. The number of miRNAs in a genome has been the subject of much discussion in the literature. Early estimates of the number of miRNAs in the worm and human genomes were put at 123 and 255, respectively (13,14). However, these estimates were based largely on conservation studies. It is now clear that many miRNAs may be clade- or even organism-specific. A number of recent large-scale studies have lifted the number of miRNA loci known in human to 533 (Table 1) (15–17), around 60% of which are obviously conserved in mouse (miRBase release 10.0).
The 5071 miRNA hairpin loci in the database express 4922 dominant mature miRNA (miR) products (Table 1). In many cases, deep sequencing technologies have detected large numbers of miR* sequences—biogenesis byproducts that are often detected at very low levels and are likely non-functional. Starting in miRBase release 10.0, mature miR and miR* sequences are better distinguished in the database, and distributed in separate release files. In many cases, mature miRNAs from both 5′ and 3′ arms of the hairpin precursor are frequently identified, suggesting that both may be functional, or there is insufficient data to determine the predominant product. Such miRNAs are given names of the form hsa-miR-140-5p and hsa-miR-140-3p, and both are retained in the miR set. Often, subsequent improved data allow one product to be chosen and annotated as the dominant miR. Recent data updates have occasionally caused the annotation of a miR and miR* pair to be reversed.
Increasingly deep and comprehensive cloning and sequencing studies identify many mature miRNAs with variable 3' (and, to a lesser extent, 5') ends [see for example (17)]. The miRNAs in the database currently represent the consensus of the most dominantly expressed sequence. As more data become available, the ends of mature miRNAs in the database will be adjusted to reflect the most up-to-date consensus information. We also aim to provide specific data on the distribution of ends in future releases. All changes in name and sequence between releases are specifically described in the diff file on the FTP site, along with all data from previous releases.
Usually the only available experimental data supports the mature miRNAs—hairpin precursors are very rarely experimentally validated. Rather, the precursors are the result of computational prediction of hairpin structures that include the mature miRNA. When a number of loci include the same mature miRNA, we cannot usually say with confidence which loci are actually expressed. In addition, the extents of the hairpins depicted in the database are somewhat arbitrary—the approximate extent of the predicted hairpin structure is shown. Formally, this includes the true precursor (the product of DROSHA cleavage) and a small amount of flanking sequence. Future developments will include the provision to retrieve the precursor with user-defined lengths of flanking sequence. About 3685 of 5922 mature miRNA products in the database are validated experimentally in the originating organism—the remainders are obvious homologues of validated miRNAs from a related species (Table 1). The ‘evidence’ field describes the origin of each sequence in the database.
The miRBase::Targets database uses the miRanda algorithm (7) to predict targets in untranslated regions (UTRs) of 37 animal genomes from Ensembl (18). The quality of the predictions has recently benefited from significantly improved 3′UTR information, based on DITAG and 5′CAGE data, available from Ensembl. The number of human and mouse transcripts without an experimentally supported 3′UTR (for which we search a region 2 kb downstream) has therefore dropped significantly in the latest release (v5). A number of validated miR/target pairs are shown to have mismatches in the so-called ‘seed’ region (19). The miRBase/miRanda pipeline is therefore not constrained by the requirement for exact ‘seed’ matches. Recent papers have also highlighted the importance of secondary features for miRNA/target recognition, such as sequence accessibility, AU bias and UTR position (20,21). We intend to incorporate these features into the miRBase::Target prediction pipeline over the coming 12 months. In addition, links are provided to other target prediction sites and algorithms, and to the TarBase database of experimentally supported targets (22).
Recently, we have focused on the provision of tools to distribute miRNA genomic information.
Where an assembled genome sequence is available, coordinates of all miRNAs are provided: in summary tables for each organism and miRNA family, on each miRNA entry page, and for bulk download in GFF format. Links are provided from each coordinate to the appropriate genome browsers.
40–70% of vertebrate miRNAs appear to be expressed from introns of protein- and non-coding transcripts (Table 1) (23). In worms and flies, intronic miRNAs are less common (15% and 39%, respectively, in protein-coding genes), and only 5–10% of Arabidopsis miRNAs overlap annotated transcripts. For all animals with Ensembl-annotated genome assemblies, we provide a list of transcripts overlapping each miRNA, with overlap type (intron, exon and UTR), and sense (forward and reverse strands).
miRNAs are often clustered close together in the genome. This clustering has been suggested as evidence that >1 miRNA may be expressed from the same primary miRNA transcript (pri-miRNA). Furthermore, known ‘polycistronic’ miRNA transcripts are shown to be long: up to tens of kilobases in mammals. Over 40% of human miRNAs, over 30% of worm and fly miRNAs and only around 10% of Arabidopsis miRNAs are within 10 kb of another miRNA (Table 1). miRBase provides a list of clustered miRNAs on each applicable entry page. In addition, a new search facility allows the user to retrieve clusters of miRNAs in any organism separated by any choice of distance.
While the mapping of mature and hairpin miRNA sequences to assembled genomes is readily available in miRBase, the extents of only very few primary miRNA transcripts (pri-miRNA) are determined and annotated. For intronic miRNAs, the pri-miRNA is assumed to be the protein- (or non-)coding host transcript. Information about the extents of intergenic pri-miRNAs can be inferred from collective analysis of genomic features such as transcription start sites (TSS), CpG islands, EST and cDNA overlap, DITAG and 5′CAGE data, transcription factor binding sites (TFBS) and polyadenylation site predictions (polyA). A detailed analysis of these data suggest that pri-miRNA transcripts vary in length from a few hundreds of bases up to tens of kilobases (24). We have recently developed a tool to visualize the relative positions of these predictions and mappings with respect to annotated miRNA genes and clusters. Careful inspection of these data allows the prediction of the 5′ and 3′ boundaries of a significant number of putative pri-miRNAs. For example, Figure 1 shows TSSs, CpG island, ESTs, cDNAs, DITAG (172B22 and 172B221) and polyA site predictions surrounding mmu-mir-135b on mouse chromosome 1, which support a primary transcript of length around 15 kb with 5′ and 3′ ends ~7–8 kb upstream and downstream of the miRNA. Links from each miRNA entry page provide a tabulated list of features overlapping flanking regions of the miRNA with their corresponding coordinates and scores, and a graphical view of the features present in the miRNA gene neighbourhood (as in Figure 1). These views are currently available for human, mouse, rat, worm and fly miRNAs, and will be extended to other organisms in the future. For human, mouse and rat genomes, TSSs are predicted using the Eponine-TSS software (25) at a threshold of 0.990. Drosophila TSS predictions, together with CpG islands, ESTs, cDNAs, repeats and DITAGs for all species are obtained from Ensembl. TFBSs in the flanking regions of human miRNAs are obtained from the conserved TFBS track of the UCSC genome browser (26). Other TFBS data are imported from the regulatory features track of Ensembl. PolyA signals are predicted in-house using the DNAFSMiner method (27) with a cutoff score of 0.6. The ‘Genomics’ section of the miRBase site allows the user to specify flanking and clustering distances, and the range of features desired.
miRBase is available on the web at http://microrna.sanger.ac.uk/. All data are available for download from the FTP site (ftp://ftp.sanger.ac.uk/pub/mirbase/) in a variety of formats including FASTA sequences and MYSQL relational database dumps.
S.G.-J. is funded by the University of Manchester. H.K.S. holds a GlaxoSmithKline postdoctoral fellowship, and work at the Sanger Institute is funded by the Wellcome Trust. Funding to pay the Open Access publication charges for this article was provided by the University of Manchester.
Conflict of interest statement. None declared.