|Home | About | Journals | Submit | Contact Us | Français|
EcoCyc (http://EcoCyc.org) provides a comprehensive encyclopedia of Escherichia coli biology. EcoCyc integrates information about the genome, genes and gene products; the metabolic network; and the regulatory network of E. coli. Recent EcoCyc developments include a new initiative to represent and curate all types of E. coli regulatory processes such as attenuation and regulation by small RNAs. EcoCyc has started to curate Gene Ontology (GO) terms for E. coli and has made a dataset of E. coli GO terms available through the GO Web site. The curation and visualization of electron transfer processes has been significantly improved. Other software and Web site enhancements include the addition of tracks to the EcoCyc genome browser, in particular a type of track designed for the display of ChIP-chip datasets, and the development of a comparative genome browser. A new Genome Omics Viewer enables users to paint omics datasets onto the full E. coli genome for analysis. A new advanced query page guides users in interactively constructing complex database queries against EcoCyc. A Macintosh version of EcoCyc is now available. A series of Webinars is available to instruct users in the use of EcoCyc.
Since the last NAR Database Issue publication on EcoCyc four years ago (1), significant additions and changes to the content and features of EcoCyc have occurred. EcoCyc staff perform an ongoing literature-based curation of the Escherichia coli genome, whose methodology and results were described in detail in 2007 (2). The EcoCyc curators edit gene names and functions, and write mini-reviews about each E. coli gene product and multimeric complex. These mini-reviews include extensive citations to the experimental literature. In mid-2006, EcoCyc reached an important milestone when EcoCyc curators had performed literature searches for every E. coli gene and had written mini-reviews for every gene for which experimental literature was found. In EcoCyc 12.5, released during fall 2008, 2650 (59.3%) E. coli genes have experimentally defined functions. Table 1 provides an overview of the current contents of EcoCyc.
Previously, curation of electron transfer reactions in EcoCyc was limited to brief written summaries of the gene products and protein complexes. This approach did not provide for a visual representation of the electron transfer enzymes in the membrane, nor did it indicate known or potential roles in cellular electron transfer and proton movement relative to the cell compartments. To address these issues, we have extended the Pathway Tools software that underlies EcoCyc in two respects: First, it can now visually depict electron transfer enzyme complexes and their associated balanced oxidation/reduction reactions (Figure 1). Reaction displays now show enzyme membrane localization, the flow of all substrates and products, and the fate of the protons associated with the overall reactions. Second, the software can now depict electron transfer pathways that consist of coupled systems of electron transfer enzymes (Figure 2).
E. coli possesses more than 25 enzymes and enzyme complexes that participate in the oxidation of primary electron donors or in the reduction of terminal electron acceptors during different cell culture conditions. The literature-based curation for approximately 15 electron transfer enzymes and enzyme complexes has been updated, and associated membrane depictions and balanced reactions are available. Electron transfer pathways have been generated and curated for 10 sets of electron donor/acceptor pairs.
An example of a membrane depiction is shown in Figure 1 for the E. coli enzyme NADH dehydrogenase I, encoded by the nuoABCDEFGHIJKLMN operon. Herein, the oxidation of NADH is shown to occur at the cytoplasmic face of the enzyme with electron transfer within the enzyme to the physiological electron acceptors, ubiquinone (UQ) or menaquinone (MQ).
Combining the oxidation reactions for a physiological electron donor and an acceptor yields an electron transport pathway. For example, in Figure 2 the NADH dehydrogenase I enzyme shown in Figure 1 is combined with cytochrome bo oxidase (cyoABCD) to represent the transfer of electrons from NADH to molecular oxygen (O2). Net movement of protons across the membrane by each enzyme complex provides, in part, the proton motive force (PMF) needed for ATP synthesis.
Curation of transcriptional regulation is performed by the RegulonDB group at the Center for Genomic Sciences, Universidad Nacional Autónoma de México. Curation of older literature on transcriptional regulation was completed in December 2006 and since then, data from new literature is consistently added to EcoCyc shortly after publication.
After reports of differences and apparent inconsistencies between the transcriptional regulatory networks of EcoCyc and RegulonDB appeared (3,4), we undertook detailed curation that led to fully synchronized content and releases in both databases (5). Other systematic curation efforts included the sigmulons of σ54 (RpoN), σ28 (FliA), σ19 (FecI), σ24 (RpoE), σ32 (RpoH), and σ38 (RpoS); various metabolic and motility regulons; and representations of the binding sites for the ArcA and NarL transcription factors. In addition, we have developed guidelines for transcription factor summaries to include relevant physiological data found in the literature that cannot be easily added as database objects. Many summaries have been updated according to these guidelines.
To facilitate the tracking and querying of data based on the quality of the evidence, we have classified the types of evidence used to annotate regulatory objects as ‘strong’ or ‘weak’. Strong evidence corresponds to experiments—irrespective of methodology—that provide direct physical evidence. Examples of strong evidence include the experimental mapping of transcription start sites and DNA binding of purified transcription factors. Evidence such as that from gene expression analyses that provide only indirect evidence is considered weak. Strong and weak evidence types are graphically distinguished by using solid or dashed lines for the corresponding objects (such as promoter arrows).
To expand the information about transcription regulation of E. coli, the RegulonDB group has incorporated various new types of experimental and predicted data into EcoCyc. A collection of 259 new transcription start sites, which resulted from a high-throughput experimental modified RACE approach, was added (6). Promoters and DNA binding sites with evidence from at least two types of high-throughput data (such as computational predictions, microarrays and ChIP-chip experiments) have been added to EcoCyc. Examples include a collection of 54 σ32 promoters experimentally identified by ChIP-chip and by gene expression assays (7); 45 σ32 promoters identified by microarray analysis, transcription initiation mapping and computational analysis (8); and 45 Fur DNA binding sites identified by computational prediction and binding of purified protein (9).
EcoCyc has included information about the regulation of both transcription initiation and enzyme activity for many years. A major new EcoCyc initiative is to expand the database schema and content to include other types of regulation, such as attenuation and regulation of translation by small RNAs (sRNAs). For example, the EcoCyc schema can now represent all six known types of regulation by attenuation of transcription, each of which involves slightly different database fields to capture aspects such as the regulatory ligand, protein and RNA regions involved. This initiative will provide both more complete information about E. coli regulation and the regulatory datasets that can be used by bioinformaticians to develop predictors for a broader diversity of regulatory interactions from genome datasets.
All known examples of ribosome-mediated attenuation in the pathways of amino acid biosynthesis have been added to EcoCyc in release 12.5. For example, Figure 3 shows regulation of the thrLABC operon by attenuation, which is modulated by the availability of charged isoleucyl- and threonyl-tRNA. In this example of attenuation, translation of the thrL leader peptide open reading frame influences the formation of an attenuator structure. When charged isoleucyl- and threonyl-tRNAs are abundant, unobstructed translation by the ribosome enables the formation of a secondary structure that acts as a terminator, releasing RNA polymerase and halting transcription of the operon. On the EcoCyc display, the charged tRNAs are represented as rods. Their role in modulating termination at the attenuator is indicated by their red color and the ‘X’ near the terminator structure; this shows at a glance that a charged tRNA leads to premature termination. Curation of other attenuation systems is ongoing.
An example of the representation of regulation by sRNAs is shown in Figure 4. The transcription unit that encompasses the glmUS operon is shown. Expression of this operon is regulated at the level of transcription initiation by the transcription factor NagC (10), whose binding sites are shown as green boxes upstream of the glmUS transcription start site. In addition, the sRNA GlmZ was recently shown to regulate translation of the second open reading frame, glmS (11,12). glmS encodes l-glutamine:d-fructose-6-phosphate aminotransferase, the enzyme that catalyzes the first step in the biosynthesis of UDP-N-acetylglucosamine, which is used as the precursor for the synthesis of peptidoglycan, lipid A and the enterobacterial common antigen. Genetic experiments suggest that full-length GlmZ interacts directly with the 5′ UTR of glmS, unmasking the ribosome binding site and thus activating translation (11,12). The interaction of GlmZ with the glmUS mRNA is shown by a bar (representing GlmZ) that is connected with lines to glmUS, suggesting base-pairing at the position indicated.
The 12.5 release of EcoCyc contains 19 examples of attenuation and 15 examples of regulation by mechanisms other than transcription initiation, attenuation, or regulation of enzyme activity. We are actively expanding both the curation of the preceding regulatory mechanisms and the ability of the Pathway Tools software to handle additional regulatory mechanisms.
Gene Ontology (GO) is an accepted standard for ontological annotation of gene products (www.GeneOntology.org). The EcoCyc project has been annotating E. coli genes with GO terms for the past two years. Overall, the more than 38 000 GO terms present in EcoCyc have been derived from four sources: (i) GO terms were inferred from a mapping from the original MultiFun (13) ontology annotations within EcoCyc to GO terms; (ii) GO terms were inferred from a mapping from the Enzyme Commission (EC) numbers present within EcoCyc to GO terms; (iii) GO term assignments are manually curated by EcoCyc curators on an ongoing basis; and (iv) many GO terms were imported into EcoCyc from UniProt. EcoCyc and the EcoliWiki project (www.EcoliWiki.net) are jointly producing an official data file of E. coli GO terms that we regularly submit to the GO project, and that is available from the GO Web site at http://www.geneontology.org/GO.current.annotations.shtml.
GO terms are found on EcoCyc gene and gene product pages and provide a useful way of finding all E. coli genes with a common function. For example, rsmD encodes an rRNA methyltransferase and is annotated with the GO process term for rRNA methylation, GO:0031167. Clicking that GO term navigates the user to a page that both provides the definition of that GO term and lists all other gene products within EcoCyc that have been annotated with that GO term. The GO term annotations within EcoCyc should be considered incomplete, as manual curation of GO terms is ongoing.
Although EcoCyc has now expanded far beyond its initial role, EcoCyc began as a database of E. coli metabolism, primarily describing metabolic enzymes and pathways. Therefore, annotations for many metabolic enzymes are among the oldest entries in EcoCyc. During the past decade, significant progress has been made in understanding E. coli metabolic pathways and their enzymes. Therefore, we have begun to systematically re-annotate these pathways; in release 12.5, 41 pathways that were entered into EcoCyc more than ten years ago, as well as 19 more recently added pathways, have been updated. As part of this effort, the curation of more than 180 metabolic enzymes has already been updated to reflect the latest state of knowledge.
The EcoCyc genome browser now supports a track mechanism to aid users in visually analyzing positional datasets with respect to genome features such as the positions of genes, promoters and transcription factor binding sites. Examples include datasets of predicted promoters, predicted transcription factor binding sites and ChIP-chip datasets. Datasets encoded as GFF-format files (http://www.sanger.ac.uk/Software/formats/GFF/) can be loaded into the desktop or Web versions of EcoCyc. Figure 5 shows a type of track specifically designed for the visualization of ChIP-chip data called a graph track.
Users of EcoCyc include both researchers who study the biology of E. coli and those who use E. coli, and thus EcoCyc, as a reference for their research in other organisms. To support both types of users, we have added several comparative tools to EcoCyc. The comparative genome browser is accessible from every gene page, and allows a user to select organisms from the hundreds that are available via the BioCyc database collection at BioCyc.org (14) and to then examine the ortholog of the starting gene in its local context within each selected organism. For example, Figure 6 shows the E. coli gene thrA aligned with its orthologs in several other organisms. The starting gene is marked with hash marks and aligned across the displays. Note that the other orthologs present are marked with the same color. For example, the adjacent gene thrB has an ortholog present in each organism displayed. The tool also indicates at the bottom of the page when no ortholog could be found. Using the multi-genome browser, users can query a broad range of organisms in search of orthologs and then can see the extent to which those orthologs have maintained their genetic context relative to E. coli.
Many users come to EcoCyc with large-scale datasets that include gene expression, proteomic and metabolomic data. As described in our earlier report on the EcoCyc database, these datasets can be viewed in the context of the E. coli metabolic network via the Cellular Omics Viewer, which is a tool that enables users to ‘paint’ the results from these datasets onto the Cellular Overview diagram. To this tool, we have recently added the Genome Omics Viewer. This new viewing tool enables the display of large-scale gene-related datasets on the full E. coli genome, providing a valuable additional tool for the interpretation of high-throughput data. As shown in Figure 7, the Genome Omics Viewer differs from the EcoCyc Genome Browser both by providing a schematic rather than a ‘to-scale’ view of the genome and by placing an emphasis on operon membership and adjacent genes. In combination, the Genome and Cellular Omics Viewers enable interpretation of large datasets in both the metabolic and genomic contexts.
The new EcoCyc advanced query page is accessible by clicking the ‘Advanced Query Form’ button located at the bottom of most EcoCyc data pages. The resulting page enables users to interactively construct complicated, multi-criteria searches against EcoCyc. Example queries include ‘Find all proteins of E. coli K-12 for which the DNA-FOOTPRINT-SIZE is smaller than 10’ and ‘Find all proteins of E. coli K-12 containing more than one subunit and that catalyze a reaction in which pyruvate is a substrate’. Instructions for the advanced query page are available at http://www.biocyc.org/webQueryDoc.html.
For many years we have provided a version of EcoCyc that runs as an application on a user's local laptop or workstation computer. This form of EcoCyc access is highly recommended for frequent EcoCyc users because it provides faster execution and more capabilities than the Web version of EcoCyc. Scientists who use either the omics data analysis facilities or the genome browser tracks will find this version's faster speeds particularly useful. Differences between the desktop and Web versions of EcoCyc are summarized at http://www.biocyc.org/desktop-vs-web-mode.shtml.
In early 2008, we adapted the desktop EcoCyc software to run on the Macintosh, adding one more personal computer option to the existing PC/Windows and PC/Linux platforms.
The EcoCyc Web site now allows users to create accounts through which they can customize the appearance of EcoCyc pages, store organism sets for comparative operations, configure default settings for the Omics Viewers, and register to receive periodic email updates about EcoCyc. See the ‘Create New Account’ link in the upper right corner of most EcoCyc Web pages.
We have produced several video tutorials that walk users through the basic and advanced use of the EcoCyc and BioCyc Web sites, and that cover the unique features of the desktop software. These videos are available in several formats directly from the BioCyc site (http://www.biocyc.org/webinar.shtml), and as podcasts via either iTunes (search for ‘BioCyc’in the podcasts section of the iTunes Store) or the video-sharing site YouTube (http://www.youtube.com/user/SRIBRG).
Flat files that contain the EcoCyc data are freely available for download at http://www.biocyc.org/download.shtml. The Pathway Tools software/database bundle is freely available to academic researchers.
National Institutes of Health (grants GM077678 and GM75742 to P.D.K., GM071962 to J.C.-V.). Funding for open access charge: NIH grant GM077678.
Conflict of interest statement. SRI authors benefit from a commercial licensing program for Pathway Tools.
We thank Dr Robert Landick for suggesting the graph-track display.