|Home | About | Journals | Submit | Contact Us | Français|
The Rfam database aims to catalogue non-coding RNAs through the use of sequence alignments and statistical profile models known as covariance models. In this contribution, we discuss the pros and cons of using the online encyclopedia, Wikipedia, as a source of community-derived annotation. We discuss the addition of groupings of related RNA families into clans and new developments to the website. Rfam is available on the Web at http://rfam.sanger.ac.uk.
The Rfam database maintains alignments, consensus secondary structures, covariance models (CMs) and corresponding annotation for RNA families. Each family represents a set of RNA sequences that function at the RNA level and share a clear common ancestor. Some examples are tRNA, microRNAs, spliceosomal RNAs, riboswitches, CRISPR elements and thermosensors. The primary purpose of the Rfam database is the automated, accurate annotation of non-coding RNAs (ncRNAs) in genomic sequences. Rfam is also frequently used as a source of high-quality alignments for training and benchmarking RNA sequence analysis software tools (1–5). Additionally, in the absence of a well-curated and up-to-date general RNA sequence database, equivalent to UniProt in the protein coding world, Rfam is also often used as a source of individual ncRNA sequences.
As described in previous Rfam publications, the database is built upon well-curated seed alignments of representative members of an RNA family (6–8). These are used to build CMs, statistical models of a family's conserved sequence and secondary structure, using the Infernal suite of analysis tools (9). The resultant covariance models are used to scan a large database of nucleotide sequences that is derived from the EMBL nucleotide archive (10). The searches return a list of putative homologs, or hits, ranked by bit-scores derived from the CMs. A hit's bit-score is the log odds ratio of the probability the hit was generated by the CM versus a random model of background sequence. An expert curator provides a threshold that in their opinion best discriminates between bona fide homologs to the seed sequences and the background distribution of false hits. Subsequently, all sequences with a bit-score above the threshold are included in an automatically generated alignment to the CM.
In order to keep Rfam as up-to-date as possible we aim to make regular releases of the database. These releases are snap-shots of the live, internal version of the database that are made publicly available via the websites and ftp. We have two types of release. A major release (indicated by an integer and a ‘.0’ in the version number e.g. ‘10.0’) usually involves updating the underlying sequence database, Rfamseq, to the latest version of EMBL and remapping all the seed sequences to the new databases. All the families are subsequently searched against the new database and, if necessary, re-thresholded. Minor releases are indicated by ‘.1’, ‘.2’, etc. in the version number e.g. ‘10.1’. These are usually made after adding many new families to the database built on the same underlying sequence database.
Rfam 10.0 was released in early 2010. This release included a major update to the underlying search algorithm, switching to a new version of Infernal, v1.0 (9). This required individually re-thresholding each Rfam family due to an important change in Infernal’s underlying scoring scheme from maximum likelihood alignment scores to summed scores over all possible alignments [i.e. switching from using the CYK algorithm to the Inside algorithm (11)]. Additionally, the new version of Infernal reports estimates of the statistical significance of hits (E-values) returned from database searches using Rfam 10.0 CM files. We also mapped all the families and searched a new version of Rfamseq based on EMBL 100 (10). The result of these and other internal improvements to our pipeline resulted in a 178% increase in the number of regions that Rfam covers, which contrasts with the rather modest increase in the size of Rfamseq by 40%. This has caused some of our alignments to become very large. For example, the tRNA full alignment now contains more than 1 million sequences. The amount of compute required for this release was roughly 5 CPU months to calibrate the models, 1 CPU year to run blast, 3 CPU years to run CM-searches (cmsearch) and 15 CPU days to produce CM-derived multiple sequence alignments (cmalign).
One of the fundamental problems facing any biocuration effort is keeping the annotation of the entities stored in a database up to date with the current literature. Typically, the annotation of existing entries changes less quickly than new data are added, so entries become rapidly out-of-date.
In mid-2007, Rfam began experimenting with using Wikipedia as a means for storing and curating the textual annotation of RNA families. Three years on, the RNA family pages have received more than 9000 edits from more than 1000 unique users. Slightly over 1% of these edits have been recognized as possible vandalism (Figure 1). The resulting marked-up annotation and curated references has dramatically improved the content of the Rfam database compared with the pre-2007 static text. The Wikipedia entries also help drive users to the Rfam website. Approximately 15% of all the web-traffic to http://rfam.sanger.ac.uk now comes via Wikipedia. As has been observed by others, a typical Google search for a biological term returns a Wikipedia entry among the top hits (12,13). From a curator’s viewpoint, Wikipedia is an excellent model to take advantage of as it includes a large community of contributors and comes with a number of user-friendly tools that help with basic editing, maintaining references and automated updates to pages with programs called bots. The large community also has other benefits, such as the well documented long-tail effect, where the majority of new content is added by a large number of editors, each of whom makes just a few edits (12,13). There are also dedicated editors who are obsessed with small but important details that an average curator may not have time to attend to, such as consistency of style, grammar and spelling. There are also editors who are dedicated to reverting obvious non-constructive edits, commonly referred to as `vandalism’, which are usually recognized and reverted within seconds. It is important to note that all edits are reviewed before appearing on the Rfam website, so the amount of overt vandalism reaching Rfam is 0. Given our positive experiences, we can highly recommend other curation efforts turning to Wikipedia for their annotation. However, it must be borne in mind that Wikipedia is built by consensus and to gain its benefits you will lose the tight control of the data allowed by in-house curation.
One of the fundamental quality control steps that Rfam employs is that no two families can annotate the same nucleotide. This rule prevents us building two or more families for essentially the same entity. When building new Rfam families or extending an existing family, we sometimes find ourselves artificially increasing the threshold to avoid overlaps with another family or trimming the ends of families that have incorrect boundaries. We also find that a single alignment may not capture all the diversity of a group of homologous RNAs. To resolve some of these issues, we have borrowed the concept of a clan from the MEROPS and Pfam databases (14,15).
We have added 99 clans for the Rfam 10.0 release. These clans describe explicit relationships between families that either clearly share a common ancestor but are too divergent to be reasonably aligned or groups of families that could be aligned, but have clearly distinct functions and therefore should be kept as separate families. For example, the RNase P clan contains five homologous families RNase MRP, archeal RNase P, nuclear RNase P and the bacterial RNase P, types a and b. These RNAs are ribozymes involved in processing of pre-tRNA and pre-rRNA sequences. The RNase Ps are, however, notoriously difficult to align to each other. Furthermore, RNase P and RNase MRP are functionally distinct molecules (16). Another clan of interest is Glm; this clan contains two homologous but functionally distinct bacterial small RNAs, GlmY and GlmZ, which act in a hierarchical fashion to regulate the translation of the glmS coding gene. GlmY activates expression of GlmZ which in turn de-sequesters the GlmS Shine-Dalgarno sequence via an anti-antisense interaction (17). The new clans mean that some of the internal quality control measures that Rfam uses can be relaxed for the clanned families. Primarily this means we can ignore our no-overlap rule, which has meant that in the past some of these families have had artificially high thresholds to avoid overlapping a related but distinct family.
In order to help assess the likelihood of a relationship between two or more families, we used a number of independent lines of evidence. These included sequence analysis based upon a SCOOP-like analysis for comparing overlapping hits from both profile hidden Markov model (HMM) and covariance model searches (18), the profile-profile comparison tool PRC (19) and literature searches for functional and evolutionary relationships. For the snoRNA and miRNA families, we were able to utilize some additional sources of information in order to establish homology. For the snoRNAs, we used some of the specialized snoRNA databases to confirm whether families targeted orthologous regions of rRNA, for many snoRNAs this helped to confirm a relationship between the families (20–23). For the miRNAs, we used the annotated seed region of the mature miRNA (24). If two or more miRNA families shared a significant amount of similarity in the seed region, and if they had further similarities identified by the sequence analysis tools, then these too were added to clans.
The new set of seed and full alignments available via the website use descriptive species labels for sequence names rather than the more cryptic EMBL accessions and coordinates that were previously provided. The provenance of the sequence data is maintained by using ‘#=GS’ tags from Stockholm format (25) to provide a mapping back to EMBL accessions (Figure 2). Stockholm is a versatile markup format for biological sequence alignments. It allows the markup of general file information, including references, comments and cross-links. It also allows the mark-up of regions of an alignment that cannot be aligned with tildes in the ‘#=GC RF’ lines.
An important feature for any biocuration effort is linking to related resources, for example, primary sequence resources databases, genomes and to specialized resources such as miRBase and the snoRNA databases. Recently, a number of groups have started developing controlled vocabularies for describing biological entities. Two efforts of particular relevance to Rfam are the sequence ontology (SO) and the gene ontology (GO) (26,27). For the majority of Rfam families, we have now added cross-links to both the SO and the GO. Many of these were provided by researchers at the functional RNA database (28). In the near future, we plan to introduce more ncRNA terms back into the ontologies. Until then the mapping will remain rather coarse-grained and closely related to the existing types Rfam uses as annotation (6). This mapping groups the RNAs into three main groups: ‘cis-reg’, ‘gene’ and ‘intron’ with subtypes such as ‘riboswitch’, ‘miRNA’ and ‘snoRNA’.
For the forthcoming minor release of Rfam, we have added a number of new and notable families. Of particular note are the direct submissions of Stockholm formatted alignments and corresponding Wikipedia articles from the RNA community via the RNA families track at RNA Biology (8). This track has released much of the burden of building these new families from our curators, and the families produced have been built and annotated by experts and are therefore of high quality. Updated families from this route include RNase MRP, SRP, tmRNA and the U3 snoRNA (29–32). In addition, several families missing from past Rfam releases have been published, including the SmY RNA, the cyanobacterial RNA Yfr2, several Trypanosomatid snoRNAs, the self-splicing ribozyme GIR1, an influenza pseudoknot, the Staphylococcus small RNA RsaOG and a putative RNA antitoxin, ptaRNA1 (33–39). The ptaRNA1 article alerted us to the fact that Rfam contains none of the published and well-characterized RNA antitoxins such as sok and symE (40). These omissions will be remedied in Rfam 10.1. A growing class of cis-regulatory elements are the environmental sensors. These are generally structured 5′ UTR elements that change conformation in response to environmental changes such as temperature or pH; this change subsequently influences the expression of the protein encoded in the host mRNA. We have added the first examples of a cold sensor and a pH sensor (41,42). Finally, we have received a dramatic number of submissions from a recent bioinformatic screen that was followed by a thorough analysis of the predictions largely based upon genomic context. This has resulted in more than 80 new additions to the database (43). Fortunately, the authors kindly provide both Stockholm formatted alignments and Wikipedia articles for these new families.
A pressing issue for Rfam is the replacement of WU-BLAST as a pre-filter for searching the Rfamseq database. The legal rights to up-to-date versions of WU-BLAST were recently acquired by a commercial entity and the software can no longer be considered free in any meaningful sense. However, there have been several developments that should allow profile HMMs to be used as effective pre-filters for covariance model searches (44). Accelerated profile HMM searches are now available through the HMMER package (45–47). In the near future, Rfam will therefore be in a position to replace the current BLAST-based filters with accelerated profile HMMs.
Sequencing projects such as the Genome 10K (48) and other attempts to fill sequencing gaps in the tree of life (49) mean that most Rfam families will dramatically increase in depth in the near future. Large alignments already pose a considerable challenge when it comes to displaying or distributing the alignments themselves, or building and displaying related data such as species and phylogenetic trees. Novel techniques will need to be developed in order to deal with these and many other issues of scale. We look forward to working with the wider community to develop these new tools and techniques.
Wellcome Trust (grant number WT077044/Z/05/Z) (to P.P.G., J.D., J.T., I.H.O., B.M. and A.B.); Howard Hughes Medical Institute (R.D.F, E.P.N., D.L.K. and S.R.E); University of Manchester (S.G.J.). Funding for open access charge: The Wellcome Trust (grant number WT077044/Z/05/Z).
Conflict of interest statement. None declared.
Many thanks to Guy Coates, James Beal and Peter Clapham for assistance with improving the performance of computational and software infrastructure. The authors received invaluable feedback at the 2009 Benasque RNA Workshop.