Fundamentally the HMDB is a multi-purpose bioinformatics–cheminformatics–medical informatics database with a strong focus on quantitative, analytic or molecular-scale information about metabolites, their associated enzymes or transporters and their disease-related properties. In many respects the HMDB combines the data-rich molecular biology content normally found in curated sequence databases such as SwissProt and UniProt (8
) with the equally rich data found in KEGG (about metabolism) and OMMBID (about clinical conditions). It also brings in a large body of independently collected experimental data, including NMR spectra, MS spectra, solubility data and validated metabolite concentrations, to compliment this literature-derived data.
The diversity of data types, the quantity of experimental data and the required breadth of domain knowledge made the assembly of the HMDB both difficult and timeconsuming. To compile, confirm and validate this comprehensive collection of data, more than two dozen textbooks, several thousand journal articles, nearly 30 different electronic databases, and at least 20 in-house or web-based programs were individually searched, accessed, compared, written or run over the course of the previous two years. In addition, more than 2100 confirmatory NMR and MS spectra were collected, 160 experimental solubility determinations were made, 75 organic syntheses were completed and hundreds of high-performance liquid chromatography (HPLC) separations were performed. The team of HMDB contributors and annotators included three organic chemists, six NMR spectroscopists, five mass spectroscopists, two separation specialists, three physicians and 14 bioinformaticians with dual training in computing science and molecular biology/chemistry.
The HMDB currently contains more than 2180 human metabolite entries that are linked to more than 27
700 different synonyms. These metabolites are further connected to some 115 non-redundant pathways, 2080 distinct enzymes, 110
000 SNPs as well as 862 metabolic diseases (genetic and acquired). More than 400 compounds are also linked to experimentally acquired ‘reference’ 1
H and 13
C NMR and MS/MS spectra. Concentration data (normal and abnormal values) for plasma, urine, CSF and/or other biofluids are also provided for a total of 883 compounds. The entire database, including text, sequence, structure and image data occupies nearly 18 GB of data—most of which can be freely downloaded.
The HMDB is fully searchable with many built-in tools for viewing, sorting and extracting metabolites, biofluid concentrations, enzymes, genes, NMR or MS spectra and disease information. Detailed instructions on where to locate and how to use these browsing/search tools are provided on the HMDB homepage. As with any web-enabled database, the HMDB supports standard text queries (through the text search box located near the top of each page). It also offers general database browsing using the ‘Browse’ and ‘Biofluids’ buttons located in the HMDB menu bar. To facilitate data browsing, the HMDB is divided into synoptic summary tables which, in turn, are linked to more detailed ‘MetaboCards’—in analogy to the very successful DrugCards concept found in DrugBank (9
). All of the HMDB's summary tables can be rapidly browsed, sorted or reformatted in a manner similar to the way PubMed (10
) abstracts may be viewed. Clicking on the MetaboCard button found in the leftmost column of any given HMDB summary table opens a webpage describing the compound of interest in much greater detail. Each MetaboCard entry contains more than 90 data fields () with half of the information being devoted to chemical or physico-chemical data and the other half devoted to biological or biomedical data (disease, biofluid concentration, enzyme, gene, SNP or metabolic pathway information). In addition to providing comprehensive numeric, sequence and textual data, each MetaboCard also contains hyperlinks to many other databases (KEGG, BioCyc, PubChem, ChEBI, PubMed, PDB, SwissProt, GenBank, OMIM and dbSNP), abstracts, digital images and applets for viewing molecular structures ().
Summary of the data fields or data types found in each MetaboCard
A screenshot montage of the Human Metabolome Database (HMDB) showing several of HMDB's search and data display tools describing the metabolite 1-Methylhistidine. Not all fields are shown.
A key feature that distinguishes the HMDB from other metabolic resources is its extensive support for higher level database searching and selecting functions. In addition to the data viewing and sorting features already described, the HMDB also offers a chemical structure search utility, a local BLAST search (11
) that supports both single and multiple sequence queries, a boolean text search based on GLIMPSE (12
), a relational data extraction tool, an MS spectral matching tool and an NMR spectral search tool (for identifying compounds via MS or NMR data from other metabolomic studies).
The HMDB's structure similarity search tool (ChemQuery) is the equivalent to BLAST for chemical structures. Users may sketch [through Advanced Chemistry Development's (ACD) freely available ChemSketch applet] or paste a SMILES string (13
) of a query compound into the ChemQuery window. Submitting the query launches a structure similarity search tool that looks for common substructures from the query compound that match the HMDB's metabolite database. High scoring hits are presented in a tabular format with hyperlinks to the corresponding MetaboCards (which in turn links to the protein target). The ChemQuery tool allows users to quickly determine whether their compound of interest is a known metabolite or chemically related to a known metabolite. In addition to these structure similarity searches, the ChemQuery utility also supports compound searches on the basis of chemical formula and molecular weight ranges.
The BLAST search (SeqSearch) allows users to search through the HMDB via sequence similarity as opposed to chemical similarity. A given gene or protein sequence may be searched against the HMDB's sequence database of metabolically important enzymes and transporters by pasting the FASTA formatted sequence (or sequences) into the SeqSearch query box and pressing the ‘submit’ button. A significant hit reveals, through the associated MetaboCard hyperlink, the name(s) or chemical structure(s) of metabolites that may act on that query protein. With SeqSearch metabolite-protein interactions from recently sequenced mammals (chimp, rat, mouse, dog, cat, etc.) may be mapped to these organisms via the human data in the HMDB.
The HMDB's data extraction utility (Data Extractor) employs a simple relational database system that allows users to select one or more data fields and to search for ranges, occurrences or partial occurrences of words or numbers. The Data Extractor uses clickable web forms so that users may intuitively construct SQL-like queries. The data extraction tool allows users to easily construct complex queries as ‘find all diseases where the concentration of homogentisic acid in urine is >1 mM’.
The NMR and MS search utilities allow users to upload spectra (for the MS search) or peak lists (for the NMR search) and to search for matching compounds from the HMDB's collection of MS and NMR spectra. The HMDB contains approximately 3800 predicted 1
H and 13
C NMR spectra for 1900 compounds. The predicted 1
H and 13
C NMR spectra were generated using the ACD/HNMR and ACD/CNMR software from Advanced Chemistry Development Inc. Validated Mol files for each compound were used as input for each prediction. In addition, the HMDB contains 930 experimentally collected 1
H and 13
C NMR spectra for 400 pure compounds (most collected in water at pH 7.0, 10 mM for 1
H, 50 mM for 13
C). It also contains 1200 MS/MS (Triple-Quad) spectra at three different collision energies for nearly 400 pure compounds. An average of 50 new NMR and MS spectra are being added each month. The HMDB's spectral search utilities allow both pure compounds and mixtures of compounds to be identified from their MS or NMR spectra via peak matching algorithms that were developed in-house. The NMR spectral matching algorithm uses a simple peak matching rule with pre-defined chemical shift tolerances. Query spectra are scored on the number of peak matches to the database spectra. The MS/MS spectral matching algorithm uses a peak matching and spectral scoring concept similar to one previously published by our group (14
). The complete set of annotated spectral images (NMR and MS, both experimental and predicted) are retrievable as zip files through the ‘Download’ button located at the top of the HMDB menu.
The link ‘HML Home’ in the HMDB menu bar refers to the Human Metabolite Library. This is a repository of all purchased, synthesized and isolated metabolites that have been acquired by the HMP team. Small quantities of individual compounds or larger collections of metabolites may be purchased (at cost) or freely acquired for collaborative research (via material transfer agreements) through the HML website and its web ordering forms. These compounds may be used as reference or quantitation standards by metabolomics researchers, or the collections may be used for drug screening, crystal screening and enzyme function assays.
Perhaps the most relevant features of the HMDB from the perspective of a medical geneticist or a clinical chemist are its rich content and extensive linkage to metabolic diseases, to normal and abnormal metabolite concentration ranges (in many different biofluids), to mutation/SNP data and to the genes, enzymes, reactions and pathways associated with many diseases of interest. Currently, the HMDB contains 115 metabolic pathway diagrams or metabolic maps. While this number may seem small, the total number of known human pathways in the KEGG database is just 190, with 72 of these being protein-only pathways (i.e. no metabolites). Nevertheless, this total is expected to increase as there are a growing number of novel gene-metabolite regulation pathways being identified via nutrigenomic research—many of which will be included in the HMDB. There are also a number of important drug and xenobiotic metabolism pathways (not in KEGG or Reactome) that will be added over the coming months.
A particularly recent addition to the HMDB is a series of SimCell (15
) metabolic wiring diagrams, SimCell models in SBML (Systems Biology Markup Language) and SimCell simulations of nearly 30 well-characterized metabolic pathways. SimCell is a metabolic simulation software package that allows complex metabolic pathways to be modeled at a cellular level and for ‘real-time’ movies of the enzymatic processes to be generated and graphed. The availability of these pre-assembled metabolic models should allow users to simply download the SimCell wiring diagram and conduct ‘in silico
’ gene knock-out experiments or test hypotheses ‘in silico
’ concerning the possible causes of a suspected genetic disorder.