Initiated by the pioneering studies of Fox and Woese (
1) 30 years ago and later on pursued by Pace, Olsen, Giovannoni, and Ward (
2–5), the ribosomal RNA (rRNA) molecule has been established as the ‘gold-standard’ for the investigation of the phylogeny and ecology of microorganisms (
6,
7). Today the more than 500 000 publicly available small and large subunit (SSU and LSU) rRNA sequences ask for specialized quality controlled databases and appropriate software tools. In anticipation of this impending deluge of rRNA data, the development of the ARB software suite and the curation of its associated databases began more than 12 years ago (
8). The software suite offers a graphical user interface and a wide variety of interacting software tools built around a common database. Furthermore, the ARB project provides structured, integrative knowledge databases for small and large subunit rRNAs. Based on regularly offered international workshops and the ARB mailing list it is currently estimated that the ARB software suite and its databases are employed worldwide by several thousand users from academia and industry. In addition to the ARB approach, there are currently three projects offering access to a set of curated ribosomal RNA sequence and alignment databases: the European rRNA databank at the University of Gent (
http://www.psb.ugent.be/rRNA/) (
9), the Ribosomal Database Project II (
http://rdp.cme.msu.edu/) at Michigan State University in East Lansing, MI (
10,
11), and the greengenes project (
http://greengenes.lbl.gov/) maintained by the Lawrence Berkeley National Laboratory in Berkeley, CA (
12). All four projects offer at least one 16S rRNA dataset, but vary in the amount of sequences, quality checks, alignments, and update procedures. However, the ARB project is the only platform that actively incorporates homologous small (SSU) as well as large (LSU) subunit sequences from all three domains of life, the
Bacteria, Archaea (16S/23S) and
Eukarya (18S/28S). All projects provide web-based software tools for the alignment and classification of sequences as well as probe match functionalities. Downloading of sequences is provided in various formats including the commonly used FASTA and GenBank file formats. Additionally, greengenes provides ARB compatible datasets, but only for nearly full length sequences (>1250 bases) of
Bacteria and
Archaea.
An increasing awareness and motivation to catalogue and protect the biodiversity on Earth using molecular techniques demands comprehensive knowledge databases spanning all three domains of life. Furthermore, a majority of the sequences available is derived from cultivation independent biodiversity surveys, which rely on rapid pattern- or clone-based approaches that often generate partial rRNA sequences. To conserve this suboptimal information especially for diversity studies, state of the art databases need to retain partial sequences.
To compensate for the limited phylogenetic resolution of the SSU rRNA (
13,
14) the two fold larger LSU rRNA should now also be included in the rRNA approach (
6). Especially for Eukaryotes the highly variable regions in the LSU rRNA are already commonly used for species discrimination (
15). Triggered by a new capacity for cheap and rapid sequencing, there is a steady flow of approximately 10 000 rRNA sequences per month into the public databases of the International Nucleotide Sequence Database Collaboration (
http://www.insdc.org). To make full use of these data for reliable phylogenetic reconstructions and biodiversity analysis careful inspection of each sequence and alignment is necessary. To support the users with this task, standardized procedures to assign a defined set of contextual information to each sequence must be established. Unified quality control mechanisms are urgently needed to intuitively flag potential problems with each sequence. The recent introduction of accelerated and less expensive sequencing technologies, such as pyrosequencing (
16), and their successful application for a census of marine microbial diversity (
17), further substantiates the need for comprehensive quality controlled databases for comparisons. Such databases provide a stable framework enabling biologists to transfer the copious data into reliable biological knowledge. The SILVA database project is designed to satisfy the request for comprehensive quality controlled and aligned rRNA datasets. It is intended to provide a central knowledge resource to alleviate users of the time consuming manual curation process.