A major current task for bioinformatics is that of representing biological information into an electronic and computable form. These computational representations make the large amounts of data amenable to analysis, integration and eventual transformation into knowledge. The integration and new way of understanding the biology of a single cell such as Escherichia coli
K12 is a major challenge in genomics and should be a milestone for systems biology. RegulonDB is currently one of the largest database offering curated knowledge of the transcriptional regulatory network of any free-living organism (1
). Escherichia coli
K12 has about two-thirds of its total gene content experimentally characterized, whereas, for around a third, there is some information about their regulation. Our curation feeds both RegulonDB and EcoCyc (2
RegulonDB is constantly updated with information derived from original research papers. The second line in provides the link to the summary table of the total curated objects throughout the years. Our own efforts on high-throughput experimental mapping of promoters, whose initial results are reported here, enabled a significant increase in the total number of experimentally identified transcription start sites (TSSs).
Links to RegulonDB and related resources
RegulonDB contains detailed information of the different elements that conform the known regulatory network of the cell, such as transcription factors (TFs), small RNAs (sRNAs) and operon structures with their associated regulatory elements: promoters, TF binding sites and terminators, and from this version on, attenuators, riboswitches and sRNA targets. The description of the network is also conceptually enriched by more precise definitions—for instance, simple and complex regulons (3
) and regulator classes (global or local and internal or external sensing) as explained below. RegulonDB is complemented with computational analyses and genome-wide predictions of operons, promoters, TF binding sites, ribosome-binding sites and, for the first time, RNA regulatory target sites.
Visualizing tools in RegulonDB allow the user to navigate in the genome (Genome browser), to identify co-regulators for a particular TF, to locate the genes’ immediate neighbors in the regulatory network, and to identify sets of genes predicted to be functionally related (Nebulon tool). Moreover, it also incorporates tools for the analysis of the transcriptional regulation of global gene expression experiments made in E. coli K12 (GETtools), as well as for exhaustive analyses focused on the detection of regulatory signals in upstream regulatory regions (RSA tools).
This paper summarizes the modifications and improvements made during the last 2 years that are transforming RegulonDB into a more comprehensive computational model of regulation of gene expression in E. coli.
Enhanced and expanded description of regulatory elements of transcription initiation
RegulonDB is mainly a manual database of regulatory information in E. coli incorporated by a team of curators from the primary literature. PubMed abstracts are selected using a set of pertinent key words related to gene regulation. When there is direct or suspected new relevant information, the full text of the articles is analyzed and the data are added to RegulonDB and EcoCyc.
Starting on January 2008, every release of RegulonDB and EcoCyc will contain up-to-date curation with a delay of no more than 3 months. To achieve this, we have used three main curation strategies: by year, by regulon and sigmulon and by physiological system.
Classification of evidences (strong and weak)
The evidences associated to all RegulonDB objects are now classified as ‘strong’ or ‘weak,’ based on the confidence level of the experiment or prediction that supports objects and their relationships. A ‘strong’ evidence is assigned to an object when the experimental data provide high certainty of its existence; otherwise, it is a ‘weak’ evidence. Examples of strong evidences are DNA binding of purified TF for regulatory interactions, mapping of TSSs for promoters, and length of mRNA for transcription units. On the other hand, gene expression analyses and computational predictions are considered weak evidences. It is important to state that several weak evidences for an object do not become a strong one. These two types of evidences are distinguished graphically with solid or dashed lines for objects supported by strong or weak evidences, respectively.
Experimental TSS mapping
Experimental determination of TSSs provides primary information critical to identify promoters and regulatory regions that control gene expression. To expand beyond literature searches our knowledge of the regulatory universe of E. coli
, we started a genome-wide project to experimentally map as many promoters as possible in this organism. For this purpose, we used a modified 5′RACE protocol (4
) with gene-specific oligonucleotides. To validate the accuracy of this strategy, we determined the TSSs for 50 TUs, which have been previously published, 92% of these TUs showed a perfect match (with a discrepancy of up to one nucleotide with respect to the published TSS). The rest showed slight ambiguity inherent to the RACE protocol, of up to six nucleotides. We detected more than one TSS in 14 of these TUs. Interestingly, for only two of them, additional TSSs had been reported. Thus, our results are highly accurate and determine additional promoters to >25% of TUs previously determined.
A total of 317 TSSs for 269 TUs (38 have more than one TSS) had been mapped with the 5′RACE methodology. One hundred and ten of them correspond to TUs with hypothetical genes for which no function has been inferred. The newly mapped TSSs have been included in RegulonDB. A detailed compendium of these findings will be published elsewhere.
Computational prediction of promoters for alternative σ factors
We have generated computational predictions for four different promoters of the σ70
family: those of σ 24, 28, 32 and 38, in addition to the already existing σ70
promoters. Promoter predictions have also been generated for the σ54
factor, which defines a different σ factor family than σ70
. The putative +1 of transcription initiation along with the −35 and −10 boxes can be downloaded from RegulonDB (see ‘Predicted promoters’ on ). The method applied to generate the promoter predictions is described in (5
) (see ‘Promoter analysis tools’ in ).
Internal versus external sensing classes of transcription factors
The active and inactive conformation of TFs is regulated by specific cell signals (commonly called ‘effectors’) that can be metabolites, ions or other chemical-signaling molecules, through covalent or allosteric interactions. The origin of these effectors can be endogenous (synthesized inside the cell), exogenous (incorporated or transported from outside the cell) or both (hybrid). As proposed in (6
), TFs are classified according to the origin of their effectors as internal, external, hybrid or unknown. This feature has been added to the TFs in the database and a link to a specific web page that shows details of the cell sensing properties of the transcriptional regulators has been created ().
All genes that code for known and predicted TFs have been annotated with their corresponding gene ontology class (7
), and we uploaded those for the rest of the genes of the genome from EcoCyc.
RNA regulatory elements
Until recently, regulation beyond transcription initiation was included in external tables. RegulonDB v.6.0 has an expanded conceptual and relational model that includes other levels and mechanisms of regulation of gene expression, such as transcriptional elongation, post-transcriptional modification and translational initiation. The first elements that are now modeled and populated are RNA regulatory elements, specifically riboswitches and attenuators, and small RNAs. The user interface has a graphic representation and textual information about their sequences, location, evidences and references; an example is shown in and tables containing all these data are available in the web pages indicated in .
Graphic representation of the new objects in RegulonDB: 1. Attenuator, 2. Riboswitch, 3. sRNA.
Riboswitches and predicted attenuators
Riboswitches and attenuators are cis
-regulatory elements that can modulate transcription elongation or translation initiation. A riboswitch is part of the 5′ non-translated region in specific bacterial mRNAs that can modulate gene expression in direct response to small molecules without the need for a protein intermediate. These regulatory elements are highly conserved, both in structure and sequence, probably due to the constraints of forming a highly structured binding pocket for the effector. Riboswitches are usually found associated with transcription or translation attenuators (8
). Several of these riboswitches have been experimentally described and their sequences are obtained from Rfam, a database of RNA families (9
). In addition to all known riboswitches, we have also added to RegulonDB all the other cis
-regulatory RNA elements present in Rfam.
Attenuators are segments of RNA in the untranslated regions of some mRNAs that can form several mutually exclusive secondary structures, but, contrary to riboswitches, are rarely conserved at the sequence level. Under certain conditions, one of the structures will be the most stable, which will have a regulatory effect. Attenuators can act at the transcriptional level by causing the premature termination of transcription or at the initiation of translation by forming Shine-Dalgarno sequester structures (10–12
). A set of more than 700 predicted attenuators (both transcriptional and translational) was generated by Merino et al.
), taking into account the structural properties of known attenuators; these predictions are now included in RegulonDB.
The sRNAs genes code for RNA sequences of <350 nucleotides long can have intrinsic catalytic activity (e.g. the 10S catalytic subunit of the RNase P), modify a protein activity (e.g. csrB
/RNA, which binds to the CsrA translational regulator and thereby antagonizes its activity), or regulate the messenger stability or translational efficiency (e.g. micF
, which binds to ompF
mRNA to repress its translation) (14
). RegulonDB now includes 49 interactions between sRNAs and their target genes.
Overviews of gene regulation biology and additional computational improvements
So far, we have provided two major mechanisms to access the knowledge available in RegulonDB: through the navigation of individual objects and their associated links and through the download of flat files with complete lists of objects and their properties (e.g. regulatory interactions, predicted promoters, terminators, operons, etc.). Now, we offer two new accession mechanisms: one through the download of the complete database (both data and schema can be downloaded in dump files to populate the most common database management systems such as MySQL, Postgress, Oracle and Apache Derby); and a new integrative level of description of the biology of gene regulation, with tables and graphs providing a collection of perspectives of different relationships and their distributions. For instance, these new tables and graphs help to identify how many and which genes are transcribed by each of the seven σ factors, the distribution of genes in operons, and the distribution of activator- and repressor-binding site positions (see ‘Data Overviews’ on ).
RegulonDB v.6.0 has several computational and graphic user interface improvements: the graphic display of all the diagrams of genes and operons has been improved with a better quality and a better definition; the object names are completely visible inside the graph objects and mouse-over tooltips have been implemented on each object to simplify their identification by the user (e.g. binding site tooltips provide their central position); another improvement is the display of gene ontology. Automatic consistency checking for different objects has also been implemented to improve data integrity.
Implementation of the Textpresso text-mining engine
RegulonDB literature can now be searched with the Textpresso text-mining engine (15
), customized for E. coli
. Textpresso allows direct exploration of the curated literature, both at the level of highly specific key words and with entire categories or ontology classes (derived from GO concepts or customized word lists). The user can, for example, search for a type of regulation in which a gene or operon and a specific TF are mentioned within sentences of different papers. Currently, the tool can search through 2472 full-text papers, 3125 paper abstracts, and more than 4200 curator notes. The addition of this text-mining tool to RegulonDB will expand the possibilities, for the end user, of traversing the knowledge space of E. coli
metabolism and gene regulation and will allow our curators to refine and confirm their annotations. See also (16
New external links
In addition to the existing external database links (Swiss-Prot, GenBank, GenProtEC, OU MicroarrayDB), RegulonDB data is also accessible through EBI (17
). We are also coordinated with the EcoliHub team to link RegulonDB as part of their wiki and integrated database tools (http://www.ecolihub.org
Many functional insights can be gained by studying the context in which genes are conserved in different organisms. With this release of RegulonDB, we include a link to the Gene Context Tool (18
) for every protein in the database, allowing the user to visualize the genomic context among all bacterial-sequenced genomes. As an example of the use of this tool, general function can, in many cases, be inferred for genes with no annotation by observing the neighboring genes of their orthologs in other bacteria.