RNRdb is implemented using a relational database with an HTML user interface, and is available at http://rnrdb.molbio.su.se
. The different proteins in RNRdb are denoted Nrd with appropriate suffixes (Tables , ), according to the common nomenclature for the corresponding genes in bacteria and archaea; when applicable synonymous names are specified for each entry. NrdA and NrdB denote the components of the class Ia RNR, where NrdA contains the active site region and binding sites for allosteric effectors, and NrdB carries the stable tyrosyl radical. NrdB proteins with a MnIV
metal centre substituting the role of the tyrosyl radical [7
] are denoted NrdBPhe
. The RNR components of class Ib are NrdE with the active site and allosteric binding regions and NrdF with the stable tyrosyl radical. In addition, class Ib operons code for NrdI, a flavoprotein, and often NrdH, a specific physiological reductant for the class Ib RNRs [9
]. Class II RNRs are denoted NrdJ. The class III RNR proper is denoted NrdD. This RNR requires a specific activase, an iron-sulphur protein denoted NrdG [13
] that belongs to the radical SAM protein family [14
]. A majority of bacteria, and some archaea, encode the global regulator NrdR [15
] that controls the transcription/translation of different RNR genes [16
] (Table ).
Distribution of RNR proteins
The database was initially populated using manually collected and curated data, and is expanded and maintained as follows: Profile hidden markov models (HMM) [18
] are generated, using HMMER [19
], from alignments of known RNR sequences in the database, representing known sequence diversity for each RNR protein. Candidate protein sequences are then retrieved from GenBank using HMMER. Candidate sequences are filtered for duplications and manually checked before incorporation into RNRdb (Fig. ). Manual curation is performed by aligning candidates to known experimentally annotated RNR sequences (a procedure for which there is theoretical precedent [20
]), to ensure only full-length sequences possessing all key sequence motifs are inserted. NrdH and NrdG candidates pose special problems; NrdH because it is less than 100 residues and has striking similarities to thioredoxins/glutaredoxins [9
], and NrdG because it is a member of the highly conserved radical SAM family [14
] and often confused with pyruvate formate-lyase activating enzyme. For this reason an NrdH candidate is only accepted if located close to other class Ib members (NrdE, NrdF, or NrdI). Likewise, an NrdG candidate is only accepted if located close to NrdD, or when this criterion is not valid only the highest scoring NrdG candidate is accepted for an organism with an existing NrdD entry.
RNRdb pipeline. RNRdb is loaded from upstream databases (see text for details) using a semi automated pipeline. Before inclusion, each sequence is manually vetted.
The alignment of candidates to known experimentally annotated RNR sequences also provides an initial indication of potential presence of self-splicing introns and inteins in the RNRs. Putative intein sequences within candidate RNR sequences are manually curated with the aid of the BLAST function of the InBase database (The Intein Database and Registry, http://www.neb.com/neb/inteins.html
]. Candidate selfsplicing intron sequences are identified within RNR genes by manual secondary structure folding of the presumed intronic RNA according to the conventional folding suggested by Cech et al. [22
Instead of using a release scheme for database content, the database is continuously updated with new sequences. In contrast, the database user interface is under a release scheme, and is currently at version 1.3. On each page, the date when data was last inserted or corrected is displayed together with the version number of the interface. At the time of writing (July 2009), the database contains over 2000 cellular organisms and viruses and over 9000 protein sequences (Table ). The main sequence data source is GenBank, but this is augmented at times with other data sources when quality sequences are available that have not yet been uploaded to GenBank. At the time of writing, we have downloaded and screened additional sequence data from the Joint Genome Institute http://www.jgi.doe.gov/
, the Broad Institute http://www.broad.mit.edu/
and the University of Tokyo Cyanidioschyzon merolae Genome Project database http://merolae.biol.s.u-tokyo.ac.jp/
in addition to data from GenBank.
Structures for representatives for all RNR proteins except the class III activase, NrdG, and the regulatory protein NrdR, have been solved. RNRdb contains annotations and descriptions for all published RNR structures, together with links to the structure files in Protein Data Bank http://www.rcsb.org/pdb/
Each RNRdb entry contains the full amino acid sequence and cross-references to the source databases for sequence and protein structure information, as well as genomic location, when known. In addition to the classification of each sequence (by class and subunit), additional attributes are listed, which enables retrieval of proteins with solved structures, experimentally derived mutational data (including the corresponding PubMed references), and presence of intervening self-splicing sequences, i.e. group I and II introns and inteins; these are cross referenced when applicable. The system for managing attributes is flexibly implemented, allowing new classification attributes to be added during curation.
Each sequence is linked to a source organism or virus record, which in turn is linked to its full NCBI taxonomy hierarchy allowing filtering of sequences based on taxonomy (see below). Organisms and viruses with fully sequenced genomes are labelled, making it possible to establish whether, for any given organism, the list of annotated RNRs is based on complete or incomplete genome sequence data. RNRdb also contains information about genomes that lack RNRs (determined through candidate screening of complete genome sequences, as described above). As of July 2009 there are only five such cases among cellular organisms, three bacteria (Borrelia burgdorferi
], Buchnera aphidicola
str. Cc [25
] and Ureaplasma urealyticum
]) and two eukaryotes (Entamoeba histolytica
] and Giardia lamblia
]). These are all parasites or obligate intracellular endosymbionts, and absence of RNRs indicates that all must rely on salvage of hostderived deoxyribonucleotides.