|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact firstname.lastname@example.org
Operon structures play an important role in co-regulation in prokaryotes. Although over 200 complete genome sequences are now available, databases providing genome-wide operon information have been limited to certain specific genomes. Thus, we have developed an ODB (Operon DataBase), which provides a data retrieval system of known operons among the many complete genomes. Additionally, putative operons that are conserved in terms of known operons are also provided. The current version of our database contains about 2000 known operon information in more than 50 genomes and about 13000 putative operons in more than 200 genomes. This system integrates four types of associations: genome context, gene co-expression obtained from microarray data, functional links in biological pathways and the conservation of gene order across the genomes. These associations are indicators of the genes that organize an operon, and the combination of these indicators allows us to predict more reliable operons. Furthermore, our system validates these predictions using known operon information obtained from the literature. This database integrates known literature-based information and genomic data. In addition, it provides an operon prediction tool, which make the system useful for both bioinformatics researchers and experimental biologists. Our database is accessible at http://odb.kuicr.kyoto-u.ac.jp/.
With the increasing availability of completely sequenced genomes, comparative genomic approaches are becoming more important to decipher the functions of genes. Methods, which are powerful, using the conservation of gene proximity on genomes (i.e. determining potential operons) can understand functional associations between genes (1–3). Genes in an operon are functionally associated with each other in prokaryotes; thus, various kinds of operon prediction methods have been developed to understand the functional relationships and to annotate genes (4–13). Databases that accumulate the experimentally verified operon information should be useful to validate such prediction methods and also to understand the functional association between genes. However, databases providing genome-wide operon information have been limited to certain specific genomes (14,15). Although the STRING database was developed to identify functional associations between genes for multiple genomes, it uses gene neighborhood based on genome context methods (16). Here, we introduce the database called ODB (Operon DataBase), which provides operon data documented in the literature and putative operons that are conserved in terms of known operons. Furthermore, to characterize operons, it integrates genome context, gene co-expression obtained from microarray data, functional links in biological pathways and data on the conservation of gene order across genomes. ODB also provides operon prediction based on these various types of data as an application of our database. These datasets are fully pre-computed so that all information can be quickly accessed. The ODB database integrates known literature-based information and genomic data. In addition, it provides an operon prediction tool, which makes the system useful for both bioinformatics researchers and experimental biologists.
We have collected information of known operons of multiple genomes from the literature. We note that the experimentally verified operons, which we have collected, have been verified by a variety of means, from direct measurements such as primer extension and northern blots to less direct methods such as gene knock-out experiments. Our database represents an ongoing effort to increase the coverage of operons. The current version of our database contains about 2000 known operon information in more than 50 genomes obtained from a total of 825 literatures (Table 1). Note that although some of these operons overlap, we use the term ‘operon’ to refer to a ‘transcriptional unit’ individually as opposed to the generally understood usage of the term that may include multiple overlapping transcriptional units. These data also include the operons of Caenorhabditis elegans. Operon structures are often observed in prokaryotes, but nematodes also have similar transcriptional systems (17,18). Thus, we added the eukaryotic operons into our database. Note that the operons from Bacillus subtilis contain operons obtained from transcriptional maps stored in BSORF (http://bacillus.genome.jp/). Because these maps were derived from the results of northern blotting experiments, we added these operons into our database. Note that these entries can be distinguished from the operons obtained from the literature, as the origin of the source (BSORF) is annotated in the database.
Table 1 also shows putative operons that are conserved in terms of known operons. When we calculated these conservations, we used KEGG OC as the ortholog gene set (19), which is ortholog gene clustering based on Smith–Waterman sequence similarity scores. If genes in a known operon have ortholog genes in another genome and these ortholog genes are consecutively located on the same strand of the genome, we regarded them as a putative but highly reliable operon. Note that this is not applied to known mono-cistronic genes. Furthermore, the putative operons were also explored from the viewpoint of paralog genes. These putative operons are also explored in eukaryotes. Usually, we do not use the term ‘operon’ for the eukaryotic gene clusters, but we use this term operationally in our database. As a result, over 13000 putative operons were observed in over 200 genomes.
ODB uses a relational database management system (MySQL, http://www.mysql.com/) to store and manage all information including not only known and putative operons but also primary data, such as gene location and definition, and associations between genes. This system contains four types of associations between genes that determine an operon: (i) intergenic distances, (ii) functional links in biological pathways, (iii) gene co-expression obtained from microarray data and (iv) the conservation of gene order across multiple genomes. These four types of associations are considered indicators and that the genes linked by them can organize an operon. Therefore, we pre-calculated these associations among all genes in all available genomes to characterize operons. Genes in an operon are often closely located on the genome compared with those between non-operons. Therefore, this is one of the indicators to characterize operons. Intergenic distances are defined as the number of bases between the end position of a gene and the start position of the next gene on the genome.
In addition, genes in an operon are often functionally related. For example genes appearing in a metabolic pathway are often clustered on the genome to be co-transcribed (20). Such functional links were obtained from KEGG pathway (19). We calculated the number of steps between genes in the pathway maps. The number of steps indicates that when two genes are linked across a compound, the number of steps is one. In this way, we calculated the number of steps not only in the same pathway map but also across different pathway maps.
The KEGG EXPRESSION database contains the gene expression data derived from microarrays of four organisms, B.subtilis, Escherichia coli K-12 W3110, Synechocystis sp. PCC6803 and Saccharomyces cerevisiae (19). We used the information of co-expressed genes from the database. We calculated the Pearson's correlation coefficients between gene expression profiles obtained from these microarray data. Because it is considered that microarray data reflect actual gene transcription and that they are powerful tools to predict operons, co-expressed gene clusters on the genome are possible operons. However, the limitation of experimental conditions and quality of the experiments still leave the issue that certain operons are not transcribed and that the level of gene co-expression is not homogeneous. Therefore, there are cases where genes are not co-expressed even if they are genes in a known operon.
Gene order in an operon is often shuffled and collapsed in evolutionary history (21,22). Therefore, conservation of gene order across genomes is rather rare, especially in distantly related genomes. If such conservation is observed, they are probably related to a physical interaction such as a molecular complex (23). Therefore, this feature is also important in characterizing operons. We calculated the step number between gene pairs. That is, given a gene pair, we took each of their ortholog genes from all genomes, calling this ‘ortholog gene pair’. Then we calculated the step number between these two ortholog genes. When the gene pair is adjacently located on the genome, the step number is regarded as one. Here, we ignore the genomes included in the same taxonomic group, which are defined in KEGG (http://www.genome.jp/kegg/catalog/org_list.html).
All-against-all runs of these associations between genes were performed. Pre-computed results are stored in a table, allowing quick retrieval against the query specified by users. Each table in our system corresponds to a particular genome to facilitate efficient access and retrieval of the information.
When users search a gene or an operon of interest, the gene cluster including it can be identified by its name and identifier. Then, the user is presented with a summary of genes and associations between genes in the region on the genome (Figure 1). Primary data such as gene names, gene IDs, definitions, KO IDs as functional classes, KEGG pathway IDs and EC numbers are presented. These are linked to the KEGG database if available. Additionally, the genomic view of the region of interest is also presented. This view includes graphical symbols of operons, genes, pathways and EC numbers and each symbol is also linked to the KEGG database. The user can also scroll and zoom the region of interest on the genome. Finally, the four types of associations are shown as separate tables. For the biological pathway table, the shortest step numbers between genes are presented. For the ortholog gene table, the shortest step numbers between the ortholog gene pairs are shown. In these tables, additional pages are accessible which show the detail of the information. For the gene expression table, the correlation coefficients between gene expression profiles are shown, and the strength of co-expression is illustrated by a color gradient ranging from blue to red.
Because the conditions to determine putative operons are very strict and are not genome-wide, ODB also provides a system to predict operons, using the four associations. Given a specific species, predicted operons that may exist within that species are returned. There are two options that are available: simple and advanced prediction mode. For a simple mode, users can obtain prediction results based on default parameter values that have been validated by known operons. However, in advanced prediction mode, users can freely change these parameter values, which are based on the four types of associations described above. When genes linked by these associations are clustered on the genome, they are likely to be an operon. Thus, we benchmarked the accuracy of the predictions based on combinations of various values of intergenic distances, step numbers between ortholog genes and the number of the genomes having conserved ortholog genes that are linked within a specific range of step numbers. Therefore, the optimal values that predict the largest number of operons while keeping the accuracy high is provided as default values in simple prediction mode (Supplementary data). When there is little or no known operon information in a genome, the default values of another genome in the same taxonomic group and having sufficient operon information is used as an alternative to the genome. If such genomes are also unavailable, we used the values of B.subtilis (see Supplementary Data for details).
ODB provides a platform for searching known operons and consequent putative operons and for predicting operons with high accuracy validated by literature-based operon data. It includes about 2000 literature-based operons in over 50 genomes and about 13000 putative operons in over 200 genomes. In addition, the data from KEGG pathway and related resources that are provided allow analyses not only based on a specific genomic context but also across genomes. Thus, it is the first of its kind to integrate operon data from a variety of genomes, providing a wide-ranging coverage of operons. This integrated system of both known literature-based and genomic data is a useful system for bioinformatics researchers and experimental biologists.
Supplementary Data are available at NAR Online.
We thank Kiyoko F. Aoki-Kinoshita for critical reading of our manuscript. This work was supported by grants from the Ministry of Education, Culture, Sports, Science and Technology, the Japan Society for the Promotion of Science, and the Japan Science and Technology Agency. The computational resources were provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University and the Super Computer System, Human Genome Center, Institute of Medical Science, the University of Tokyo. Funding to pay the Open Access publication charges for this article was provided by the grant-in-aid for scientific research from the Ministry of Education.
Conflict of interest statement. None declared.