PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of narLink to Publisher's site
 
Nucleic Acids Res. 2013 January; 41(Database issue): D787–D792.
Published online 2012 November 21. doi:  10.1093/nar/gks1108
PMCID: PMC3531061

WholeCellKB: model organism databases for comprehensive whole-cell models

Abstract

Whole-cell models promise to greatly facilitate the analysis of complex biological behaviors. Whole-cell model development requires comprehensive model organism databases. WholeCellKB (http://wholecellkb.stanford.edu) is an open-source web-based software program for constructing model organism databases. WholeCellKB provides an extensive and fully customizable data model that fully describes individual species including the structure and function of each gene, protein, reaction and pathway. We used WholeCellKB to create WholeCellKB-MG, a comprehensive database of the Gram-positive bacterium Mycoplasma genitalium using over 900 sources. WholeCellKB-MG is extensively cross-referenced to existing resources including BioCyc, KEGG and UniProt. WholeCellKB-MG is freely accessible through a web-based user interface as well as through a RESTful web service.

INTRODUCTION

A primary challenge in computational biology is to predict how complex phenotypes such as growth and replication arise from networks of individual molecules. Whole-cell models promise to tackle this challenge by integrating heterogeneous molecular data into predictive computational models. This integration requires model organism databases which comprehensively provide readily computable molecular data.

WholeCellKB is an open-source, web-based software program for developing comprehensive model organism databases for whole-cell models. As illustrated in Figure 1, WholeCellKB enables whole-cell modeling by organizing diverse molecular data from primary research articles, reviews, books and databases into a single database. The WholeCellKB data model supports detailed descriptions of individual species including their genes, operons, proteins, macromolecular complexes, molecular interactions, chemical reactions and pathways. Importantly, WholeCellKB also facilitates extensive source documentation. We used WholeCellKB to develop WholeCellKB-MG, an extensive database of the pathogenic Gram-positive bacterium Mycoplasma genitalium.

Figure 1.
WholeCellKB-MG enables whole-cell modeling by integrating diverse data sources into a single database. (a) Currently, WholeCellKB-MG integrates >900 primary research articles, reviews, books and databases. (b) WholeCellKB-MG comprehensively represents ...

Here, we describe WholeCellKB-MG’s content, curation, user interface and implementation. We also compare WholeCellKB-MG to existing resources, highlighting WholeCellKB-MG’s greater scope and granularity. Finally, we discuss our future plans for WholeCellKB.

CONTENT

Our goal was to create a database comprehensive enough to enable a whole-cell model (1). As illustrated in Figure 2, WholeCellKB-MG broadly represents M. genitalium molecular biology including (i) its subcellular organization; (ii) its chromosome sequence; (iii) the location, length, direction and essentiality of each gene; (iv) the organization and promoter of each transcription unit; (v) the expression and degradation rate of each RNA transcript; (vi) the specific folding and maturation pathway of each RNA and protein species including the localization, N-terminal cleavage, signal sequence, prosthetic groups, disulfide bonds and chaperone interactions of each protein species; (vii) the subunit composition of each macromolecular complex; (viii) its genetic code; (ix) the binding sites and footprint of every DNA-binding protein; (x) the structure, charge and hydrophobicity of every metabolite; (xi) the stoichiometry, catalysis, coenzymes, energetics and kinetics of every chemical reaction; (xii) the regulatory role of each transcription factor; (xiii) its chemical composition and (xiv) the composition of its laboratory growth medium. Table 1 summarizes WholeCellKB-MG’s size and content.

Figure 2.
WholeCellKB aims to comprehensively describe cell physiology including the structure and dynamics of every metabolite, gene, RNA transcript and protein. Boxes illustrate several molecular properties represented by WholeCellKB.
Table 1.
WholeCellKB-MG size

CURATION

We curated WholeCellKB-MG in five steps based on >900 primary research articles, reviews, books and databases. First, we curated the overall structure of M. genitalium including its size, shape, subcellular organization and chemical composition based on several experimental studies including Morowitz et al. (2). We also assembled the chemical composition of Mycoplasma laboratory growth medium based on analyses reported by Solabia (3).

Second, we curated the structure of the M. genitalium chromosome including its sequence, the location, length and direction of each gene and its transcription unit organization based on the Comprehensive Microbial Resource (CMR) annotation (4) and a recent study by Güell et al. (5). We reconstructed the location of each promoter and the expression, degradation rate and essentiality of each gene product from four recent studies (6–9). We catalogued DNA-binding sites and transcriptional regulatory interactions from several sources including DBTBS (10).

Third, we assembled the structure of each RNA and protein gene product. We compiled the post-transcriptional processing and modification of each RNA transcript from several sources including Peil (11). We reconstructed the signal sequence, localization, chaperone-mediated folding, post-translational modification, disulfide bonds, subunit composition and DNA footprint of each protein and macromolecular complex from a large number of primary research articles, computational models and databases. We assembled the chemical regulation of each gene product from several sources including DrugBank (12). We used ExPASy ProtParam (13) to calculate the pI, extinction coefficient, half-life, instability index, aliphatic index and grand average of hydropathy of every protein species.

Fourth, we curated the specific chemical reactions catalyzed by each gene product starting from the CMR (4), GenBank (14), KEGG (15) and UniProt (16) genome annotations and the reconstructed RNA and protein maturation pathways. To maximize the scope of the database and to fill gaps in the genome annotation, we expanded each gene product’s annotation based on primary research articles we identified by searching PubMed (17) and Google Scholar (http://scholar.google.com). We consulted BioCyc (18), KEGG (15), two flux-balance analysis (FBA) models of bacterial metabolism (19,20) and hundreds of additional primary research articles to curate the stoichiometry of each chemical reaction. We assembled the thermodynamics and kinetics of each chemical reaction from several databases including BRENDA (21), SABIO-RK (22) and UniProt (16) and a FBA model (20).

Finally, we compiled the M. genitalium metabolome. We included all metabolites involved in the reconstructed reactions, biomass or growth medium. We curated the empirical formula, structure, charge and intracellular concentration of each metabolite from several databases including BioCyc (18), CyberCell (23) and PubChem (24) and a comprehensive mass-spectrometry study (25). We used ChemAxon Marvin (http://www.chemaxon.com/products/marvin) to calculate the molecular weight, van der Waals volume, pI, logd and logp of each metabolite.

In order to create a comprehensive description of M. genitalium physiology, we based WholeCellKB-MG on studies of closely related organisms where studies of M. genitalium were unavailable. In cases where multiple observations were available, we based the reconstruction on the most closely related organism. We used bi-directional best BLAST (26) to identify homologous genes. To provide model transparency, we tracked the species, experimental conditions and citation of each piece of evidence.

COMPARISON TO EXISTING RESOURCES

WholeCellKB represents the specific molecular interactions of individual species similar to previous databases such as BioCyc (18,27) and BiGG (28). In particular, WholeCellKB’s data model, user interface and species-specific content were heavily inspired by BioCyc.

Importantly, WholeCellKB-MG also has several major differences from existing resources. First, WholeCellKB-MG more broadly represents cell physiology. WholeCellKB-MG represents the molecular details of 28 cellular processes including well-studied processes such as metabolism as well as less well-understood processes such as DNA damage and repair and RNA and protein degradation. The online documentation at http://wholecellkb.stanford.edu/about provides further information about the WholeCellKB-MG data model and how WholeCellKB-MG represents each cellular process. Figure 3 compares WholeCellKB-MG’s content to that of several existing databases.

Figure 3.
Detailed comparison of the content of WholeCellKB-MG and several existing biological databases. In addition to containing detailed descriptions of genetics, metabolism and transcriptional regulation comparable to existing resources such as BiGG (28), ...

Second, whole-cell modeling requires model organism databases which explicitly define the participants of each molecular interaction and chemical reaction. WholeCellKB-MG addresses this need by representing the specific molecules involved in every molecular interaction and by requiring structures for each molecule. For example, WholeCellKB-MG represents the specific RNA bases involved in every RNA methylation reaction, whereas existing resources lump RNA methylation interactions into a single generic reaction. WholeCellKB-MG represents every major cellular process including RNA processing and protein processing, modification and translocation with similarly fine molecular resolution.

Third, where available WholeCellKB-MG contains not only structural but also quantitative functional descriptions of each molecule and molecular interaction. For example, WholeCellKB-MG contains chemical reaction rate laws and kinetic parameters, RNA transcript expressions and half-lives, and cellular and growth medium chemical compositions. In total, WholeCellKB-MG represents 1836 heterogeneous model parameters. Table 2 summarizes how WholeCellKB represents these heterogeneous parameters using several types of database entries.

Table 2.
WholeCellKB-MG parameters

DATA INPUT

WholeCellKB provides administrators with two editing interfaces: (i) a web form to edit single entries and (ii) an Excel-based interface to simultaneously edit multiple entries. We believe that these two interfaces enable collaborative model organism database development.

In the beginning of our M. genitalium curation efforts, we primarily used the batch interface to quickly import large amounts of data from other genome annotations. We continued to use the batch interface throughout the project to import high-throughput molecular data. Later in our M. genitalium curation efforts, we primarily used the form interface to refine our annotation based on specific biochemical studies. Overall, we found that WholeCellKB improved the quality of our annotation and in particular encouraged us to thoroughly annotate the original source of each datum.

Data submitted to WholeCellKB was extensively validated to ensure consistency and correctness. For example, WholeCellKB checked that each chemical formula was valid, that each reaction was mass-balanced and that every molecule and kinetic parameter was defined in each reaction rate law. WholeCellKB provided hints on how to correct invalid data such as the atom imbalance of invalid reactions.

DATA ACCESS

WholeCellKB-MG is freely accessible through a simple and intuitive web-based interface at http://wholecellkb.stanford.edu. This web-based interface allows users to quickly browse, search and export the database. It also allows administrators to add, edit and delete entries. Importantly, the interface is extensively commented and hyperlinked, allowing users to easily find the primary source of each datum.

WholeCellKB-MG is also accessible through a RESTful interface. This interface provides the content of every HTML page in JSON and XML formats. We are currently using this interface to develop software for visualizing whole-cell simulations.

DEVELOPER API

WholeCellKB was designed to enable modelers to develop model organism databases for whole-cell models, including designing custom data models and user interfaces. WholeCellKB provides a framework for viewing, searching, exporting and editing database entries which developers can combine with custom data models and HTML templates. This allows developers to build custom model organism databases with minimal effort and without any knowledge of database design. Furthermore, because WholeCellKB is open source and implemented with Python, modelers can easily display scientific calculations alongside curated data in the user interface. The online documentation provides further instructions on how to customize WholeCellKB.

IMPLEMENTATION

WholeCellKB was implemented in Python using the Django (http://www.djangoproject.com) web framework and stored using the relational database MySQL (http://www.mysql.com). Full-text search was implemented using Haystack (http://haystacksearch.org) and Xapian (http://xapian.org). Excel, JSON and XML export were implemented using OpenPyXL (http://bitbucket.org/ericgazoni/openpyxl), simplejson (http://pypi.python.org/pypi/simplejson) and xml.dom (http://docs.python.org/ library/xml.dom.html). WholeCellKB runs on the Apache (http://www.apache.org) web server using the mod_wsgi (http://code.google.com/p/modwsgi) module. All of the software used to implement WholeCellKB is available open source.

SUMMARY AND FUTURE DIRECTIONS

WholeCellKB-MG is an extensive database of M. genitalium designed to facilitate whole-cell modeling. Currently, we are continuing to curate the database as well as starting to create equally comprehensive databases of other model microorganisms. Beyond facilitating realistic whole-cell models, we believe that these databases are useful platforms for experimental and computational biologists.

We created WholeCellKB-MG using WholeCellKB, an open-source, web-based software program which enables modelers to quickly develop model organism databases for whole-cell modeling.

Beyond continuing to curate model organisms, we also plan to continue to strengthen the WholeCellKB software. We plan to add additional tools for importing databases curated with other tools such as PathwayTools (27), storing the detailed history of each database entry and comparing model organism databases as well as expanding the search functionality of the RESTful API. As the whole-cell modeling community grows, in the future we also plan to enable open-editing similar to Wikipedia. Finally, we are currently using WholeCellKB’s RESTful API to develop tools for visualizing whole-cell simulations.

We hope that other researchers will use WholeCellKB to develop model organism databases and whole-cell models. We believe that WholeCellKB will not only speed up database curation and whole-cell model development but also encourage best annotation practices. Ultimately, we hope that WholeCellKB in combination with whole-cell models will accelerate biological discovery and bioengineering.

FUNDING

NIH Director’s Pioneer Award [5DP1LM01150-05] and a Hellman Faculty Scholarship (to M.W.C.); NDSEG, NSF and Stanford Graduate Fellowships (to J.R.K.); NSF and Bio-X Graduate Student Fellowships (to J.C.S.) and a Stanford Graduate Fellowship (to D.N.M.). Funding for open access charge: NIH Director’s Pioneer Award [5DP1LM01150-05].

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We thank Elsa Birch, Nick Ruggero and Ruby Lee for enlightening discussions on database design, curation, modeling and visualization.

REFERENCES

1. Karr JR, Sanghvi JC, Macklin DN, Jacobs JM, Gutschow MV, Bolival B, Assad-Garcia N, Glass JI, Covert MW. A whole-cell computational model predicts phenotype from genotype. Cell. 2012;150:389–401. [PMC free article] [PubMed]
2. Morowitz HJ, Tourtellotte ME, Guild WR, Castro E, Woese C. The chemical composition and submicroscopic morphology of Mycoplasma gallisepticum, Avian PPLO 5969. J. Mol. Biol. 1962;4:93–103. [PubMed]
3. Solabia. Biotechnology Products. 2011. Retrieved from http://www.solabia.com/ (14 March 2011, date last accessed)
4. Davidsen T, Beck E, Ganapathy A, Montgomery R, Zafar N, Yang Q, Madupu R, Goetz P, Galinsky K, White O, et al. The comprehensive microbial resource. Nucleic Acids Res. 2010;38:D340–D345. [PMC free article] [PubMed]
5. Güell M, van Noort V, Yus E, Chen WH, Leigh-Bell J, Michalodimitrakis K, Yamada T, Arumugam M, Doerks T, Kühner S, et al. Transcriptome complexity in a genome-reduced bacterium. Science. 2009;326:1268–1271. [PubMed]
6. Weiner J, 3rd, Herrmann R, Browning GF. Transcription in Mycoplasma pneumoniae. Nucleic Acids Res. 2000;2:241–249.
7. Weiner J, 3rd, Zimmerman CU, Göhlmann HW, Herrmann R. Transcription profiles of the bacterium Mycoplasma pneumoniae grown at different temperatures. Nucleic Acids Res. 2003;37:6306–6320. [PMC free article] [PubMed]
8. Bernstein JA, Khodursky AB, Lin PH, Lin-Chao S, Cohen SN. Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays. Proc. Natl Acad. Sci. USA. 2002;22:235–244. [PubMed]
9. Glass JI, Assad-Garcia N, Alperovich N, Yooseph S, Lewis MR, Maruf M, Hutchison CA, 3rd, Smith HO, Venter JC. Essential genes of a minimal bacterium. Proc. Natl Acad. Sci. USA. 2006;77:1175–1181.
10. Sierro N, Makita Y, de Hoon M, Nakai K. DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Res. 2008;5:e8664. [PMC free article] [PubMed]
11. Peil L. 2009. Ribosome assembly factors in Escherichia coli. Master Thesis. Tartu University.
12. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, et al. DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic Acids Res. 2011;14:D554–D556. [PMC free article] [PubMed]
13. Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A. Protein identification and analysis tools on the ExPASy server. In: Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A, editors. The Proteomics Protocols Handbook. Totowa, NJ: Humana Press; 2005. pp. 571–607.
14. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2011;39:D32–D37. [PMC free article] [PubMed]
15. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular datasets. Nucleic Acids Res. 2012;40:D109–D114. [PMC free article] [PubMed]
16. The UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt) Nucleic Acids Res. 2012;40:D71–D75. [PMC free article] [PubMed]
17. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010;38:D5–D16. [PMC free article] [PubMed]
18. Keseler IM, Collado-Vides J, Santos-Zavaleta A, Peralta-Gil M, Gama-Castro S, Muniz-Rascado L, Bonavides-Martinez C, Paley S, Krummenacker M, Altman T, et al. EcoCyc: a comprehensive database of Escherichia coli biology. Nucleic Acids Res. 2011;39:D583–D590. [PMC free article] [PubMed]
19. Suthers PF, Dasika MS, Kumar VS, Denisov G, Glass JI, Maranas CD. A genome-scale metabolic reconstruction of Mycoplasma genitalium, iPS189. PLoS Comput. Biol. 2009;26:4694–4708. [PMC free article] [PubMed]
20. Feist AM, Henry CS, Reed JL, Krummenacker M, Joyce AR, Karp PD, Broadbelt LJ, Hatzimanikatis V, Palsson BØ. A genome-scale metabolic reconstruction for Escherichia coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic information. Mol. Syst. Biol. 2007;28:15–33. [PMC free article] [PubMed]
21. Scheer M, Grote A, Chang A, Schomburg I, Munaretto C, Rother M, Söhngen C, Stelzer M, Thiele J, Schomburg D. BRENDA, the enzyme information system in 2011. Nucleic Acids Res. 2011;39:D670–D676. [PMC free article] [PubMed]
22. Wittig U, Kania R, Golebiewski M, Rey M, Shi L, Jong L, Algaa E, Weidemann A, Sauer-Danzwith H, Mir S, et al. SABIO-RK—database for biochemical reaction kinetics. Nucleic Acids Res. 2012;40:D790–D796. [PMC free article] [PubMed]
23. Sundararaj S, Guo A, Habibi-Nazhad B, Rouani M, Stothard P, Ellison M, Wishart DS. The CyberCell Database (CCDB): a comprehensive, self-updating, relational database to coordinate and facilitate in silico modeling of Escherichia coli. Nucleic Acids Res. 2004;32:D293–D295. [PMC free article] [PubMed]
24. Bolton E, Wang Y, Thiessen PA, Bryant SH. PubChem: integrated platform of small molecules and biological activities. In: Bolton E, Wang Y, Thiessen PA, Bryant SH, editors. Annual Reports in Computational Chemistry. Washington, DC: American Chemical Society; 2008. pp. 217–241.
25. Bennett BD, Kimball EH, Gao M, Osterhout R, Van Dien SJ, Rabinowitz JD. Absolute metabolite concentrations and implied enzyme active site occupancy in Escherichia coli. Nat. Chem. Biol. 2009;5:593–599. [PMC free article] [PubMed]
26. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. [PubMed]
27. Karp PD, Paley SM, Krummenacker M, Latendresse M, Dale JM, Lee TJ, Kaipa P, Gilham F, Spaulding A, Popescu L, et al. Pathway tools version 13.0: integrated software for pathway/genome informatics and systems biology. Brief. Bioinform. 2010;11:40–79. [PMC free article] [PubMed]
28. Schellenberger J, Park JO, Conrad TM, Palsson BØ. BiGG: a biochemical genetic and genomic knowledgebase of large scale metabolic reconstructions. BMC Bioinformatics. 2010;11:213. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press