The database has grown dramatically over the past two years: from 25 families annotating around 55
000 regions in the nucleotide sequences databases in release 1.0, to 379 families annotating over 280
000 regions in release 6.1. This growth is partly due to a significant increase in scope. The evolution of some large gene families, such as miRNAs and snoRNAs, is constrained partially by inter-molecular base-pairing, and thus they do not conserve significant sequence or secondary structure. While we cannot therefore represent all C/D box snoRNAs, or all miRNAs, with a single alignment and model, subfamilies are conserved and are now well represented in the database. Rfam also now includes not only bona fide
ncRNA genes, but also structured regions of mRNA transcripts. These fall into two broad classes: self-splicing introns and cis
-regulatory elements in the untranslated regions (UTRs). The latter can be used as detectors for a wide range of environmental conditions [e.g. bacterial riboswitches bind a range of metabolites as reviewed previously (7
), and the 5′-UTR of the PrfA acts as a temperature-dependent switch (9
)] to regulate message stability or translational efficiency.
This increased scope has led to the introduction of a limited type ontology, with the top-level types representing the three classes of structured RNA discussed above—‘Gene’, ‘Intron’ and ‘Cis-reg’. The database currently contains 308 gene families, 69 cis-regulatory elements and two self-splicing introns. The type field provides one of the primary entry points for family browsing and searching, enabling the user to quickly identify all snoRNA gene families for instance, or to find all riboswitches in the database.
One of the primary uses of the Rfam database is to search for homologues of known RNAs in a query sequence, including a complete genome. Indeed, the profile SCFG library has been used to annotate a number of newly sequenced genomes [e.g. Caenorhabditis briggsae
), chicken (11
) and Erwinia caratova
)]. In addition, we calculate hits in over 200 complete genomes and chromosomes. These data are available through the web interface and are discussed briefly in the following section.