3.1. Lipid Databases and Other Small Molecule Databases Containing Lipids
Lipids are generally hydrophobic in nature and soluble in organic solvents. However, lipid molecules show a remarkable structural and combinatorial diversity unlike other biological molecules such as nucleic acids and proteins. Chemical structures of lipids across different lipid categories are quite different and cover a wide range of chemical space. For example, sterol lipids are characterized by a four fused ring template consisting of three six membered rings and one five membered ring; Glycerolipids, on the other hand, typically do not contain any rings and contain radyl chains attached to sn carbons on glycerol group. The radyl chains may be further unsaturated with varied double bond positions and geometry adding to the structural heterogeneity of lipids. Additionally, a large number of possible radyl chains at various sn carbons on glycerol group along with different head groups lead to combinatorial isomeric positional diversity of lipid structures for various lipid categories such as glycerolipids, glycerophospholipids and sphingolipids. Given the structural diversity of lipids and the importance of their role in the regulation and control of cellular function and disease, it is essential to have a database of lipids which not only facilitates the storage, retrieval and dissemination of existing lipid structures and associated physiochemical properties data for the lipidomics community but is also extensible, flexible and scalable to handle the vast amount of data being generated by new lipidomic studies. A well-designed lipids database must include a defined ontology which incorporates classification, nomenclature, structure representations, definitions, related biological/biophysical properties, cross-references and physicochemical properties (formula, molecular weight, number of carbon atoms, number of various functional groups, etc.) of all objects stored in the database. This ontology can then be transformed into a well-defined schema that forms the foundation for a relational database of lipids. A large number of repositories (e.g. GenBank,
22 SwissProt,
23 ENSEMBL
24 and GlycomeDB
25) exist to support nucleic acids, proteins and carbohydrate databases; however, there are only a few specialized databases and resources (e.g. LMSD, LipidBank,
9c,d LIPIDAT,
9a,b Lipid Library
9e and Cyberlipids
9f) that are dedicated to cataloging lipids. A variety of other small molecule public and commercial databases (e.g. Human Metabolome Database (HMDB),
26 DrugBank,
27 Therapeutic target database (TTD),
28 Chemical Entities of Biological Interest (ChEBI),
29 ChemBank,
30 PubChem,
31 ZINC,
32 ChemSpider,
33 Chemical Abstract Service (CAS),
34 eMolecules,
35 Beilstein
36 and Kyoto Encyclopedia of Genes and Genomes (KEGG) LIGAND
37) also exist which provide information about lipid structures and their associated physicochemical properties.
While there has been no prior effort at systematic and comprehensive classification and nomenclature of lipid molecules, there are several small databases as mentioned in the previous paragraph which contain some or several lipid molecules. The LMSD database being developed by LIPID MAPS consortium is one of the latest databases dedicated to lipids and provides comprehensive information about lipids. We provide an overview of the LMSD database, other lipid specific databases and small molecule databases () containing lipids in the rest of this section followed by detailed description of the LMSD database.
| Table 2Resources and databases containing information about lipids. |
The LMSD
10 is a relational database containing structures and annotations of biological relevant lipids. It is being developed and maintained by LIPID MAPS consortium, and currently contains over 30,000 structures which are obtained from the following sources (): LIPID MAPS Consortium's core laboratories and partners; lipids identified by LIPID MAPS experiments; computationally generated structures for appropriate lipid classes; biologically relevant lipids manually curated from LIPID BANK, LIPIDAT and other public databases; peer-reviewed journals and book chapters describing lipid structures.
The LIPID BANK is a lipid database of Japanese Conference on the Biochemistry of Lipids (JCBL). It contains over 7,000 lipids corresponding to the following main lipid classes: acylglycerol, bile acid, derived lipid, eicosanoid, ether type lipid, fat soluble vitamin, glycolipid, isoprenoid, lipid peroxide, lipoamino acid, lipopolysaccharide, lipoprotein, mycolic acid, phospholipid, steroid, and wax. In addition to classification-based browsing of lipids, the LIPID BANK supports text-based search and retrieval of lipids data using name and other physicochemical properties; the structure-based search is not available. The search results along with structure and other basis information such as molecular weight, molecular formula, name, and common name provide the following additional information about a lipid: biological activity, physical and chemical properties, spectral data (Ultra violet (UV), Infrared (IR), Nuclear magnetic resonance (NMR), Mass spectrometry (MS)), chromatogram data, chemical synthesis, metabolism, genetic information, and references.
The LIPIDAT is a relational database of thermodynamic and associated physicochemical properties information on lipids. It contains over 20,000 lipids. The users can search the database using various physicochemical properties through more than 2 dozens available text-based query pages. The detailed search results page about a lipid includes the following information: structure, name, and formula along with other basic information; bibliographic information; experimental results and methods.
The LIPID LIBRARY is not a database of lipids but an online resource about chemistry, biology, technology, and analysis of lipids. The online pages provide information about lipids organized into the following sections: basic information, biochemistry and nutrition, lipid analysis, oils and fats, and latest news. The basic information section covers structures, definitions, composition, biochemistry, and functions of these lipid categories: fatty acids and eicosanoids, simple and complex glycerolipids and phospholipids, sphingolipids, and sterols. The biochemistry and nutrition section covers only plant lipid biochemistry. The lipid analysis section provides descriptions of both chromatographic and spectroscopic techniques used for analysis of lipids along with literature surveys of analytical methodologies. The oils and fats section cover the chemistry and technology of oils and fats along with the history of science and technology. The detailed information available for lipids covered in basic information section provides the following details for each lipid: structure, name, source and occurrence, biochemistry and function along with appropriate literature references.
The Cyberlipids is an online resource for studies of lipids. It provides information about definitions, source, compositions, and physicochemical properties of lipids along with detailed review of various lipid analysis techniques. The users can retrieve detailed information about a lipid using its name for more than 900 lipids or get a list of all lipids with links to detailed information.
The Human Metabolome Database (HMDB) is a database containing information about small molecule metabolites, including lipids, found in the human body. It contains over 7,900 metabolites entries with links to over 7,200 protein and deoxyribo nucleic acid (DNA) sequences. The database provides links to three kinds of data: chemical data, clinical data, and molecular biology/biochemistry data. The users can search HMDB using text, chemical structure, and arbitrary relationships of available data fields. The database searching using spectral and chromatography data (MS, MS-MS, GC-MS, and NMR) is also available. Additionally, a variety of different data browsing options are provided: class-based browsing, pathway, disease, and so on. The detailed information about each molecule is presented as a MetaboCard containing over 110 different data fields with 2/3rd of the data fields containing information about chemical/clinical data and the rest about enzymatic and biological data. The links to other external data sources are also provided.
The DrugBank database provides detailed information about drugs, including lipids, along with the drug targets. The detailed drug information consists of chemical, pharmacological, and pharmaceutical information; the targets information corresponds to sequence, structure, and pathway. The database contains over 6,800 drug entries covering the following types of drugs: over 1,400 food and drug administration (FDA)-approved small molecule drugs, over 130 FDA-approved biologics drugs, over 83 nutraceuticals, and over 5,000 experimental drugs. Additionally, information for over 4000 non-redundant protein target sequences is linked to drug entries. The users can search the DrugBank database using text, chemical structure, and arbitrary combination of available data fields. A variety of different data browsing options are also available: drug name, pathway, class name, and so on. The detailed information about each drug is presented as a DrugCard containing over 150 data fields with half the information covering drug/chemical data and the rest corresponding to drug target.
The Therapeutic Targets Database (TTD) provides information about known targets along with information for associated disease, pathways, and drugs for these targets. The TTD database contains information for over 1,900 targets and over 5,000 drugs with over 3,000 small molecule drugs. The drugs information covers over 1,500 approved drugs, over 1,100 drugs in clinical trials, and over 2,300 experimental drugs. The text-based database search provides searching using target/disease name, drug name, function and classification. The detailed search results page contains information about target and disease, drug name and its function, and links to other external database containing information about targets and drugs.
The Chemical Entities of Biological Interest (ChEBI) database provides structural and ontological information about molecular entities focused on small molecule compounds including lipids. The molecular entities are either natural products or synthetic products used for biological intervention; nucleic acids are not included. The ChEBI database contains over 19,000 small molecules. The information about small molecules in ChEBI comes from these four key sources: IntEnz
38 – the integrated relational enzyme database of the European Bioinformatics Institute (EBI); KEGG COMPOUND;
39 PDBeChem;
40 and ChEMBL
41 The users can search ChEBI database using text, chemical structure, and arbitrary combination of available data fields. The structure-based search also supports similarity and substructure searching. The detailed search results along with structure and other basic information such as molecular weight, molecular formula, name, and common name provide the following additional information about a small molecule: ChEBI ontology, brand name, references to other databases, registry numbers corresponding to external sources (CAS, Beilstein, and Gmelin), and literature references.
The ChemBank is a relational database containing data derived from small molecules, including lipids, and small molecule screens along with tools for analyzing these data. The database contents include chemical structures and names, calculated molecular descriptors, human curated information about small molecules activities, raw experimental results from high-throughput biological assays, and metadata describing the screening experiments. The ChemBank database contains data for over 1.7 million compound samples with over 1.2 million unique small molecule structures screened against more than 2,500 assays covering more than 180 projects. Additionally, it contains information for over 1,000 proteins, 500 cell lines and 70 species associated with various assays. The users can search ChemBank using text, chemical structure, and arbitrary relationships of available data fields. The structure-based searching, in addition to substructure and exact match, also supports similarity searching. The database searching using information about high-throughput screens and small molecule assays is also available. Additionally, a number of tools for analysis and visualization of small molecule screening results are provided. The detailed search results along with structure and other basic information such as molecular weight, molecular formula, name, and common name provide the following additional information for a small molecule: a large number of calculated physicochemical properties; compound sample information; screening information including project name, assay name, assay type, plate, well and z-score.
The PubChem database is a database of chemical molecules and biological activities of molecules screened against various assays. It also contains information about lipids as LIPID MAPS consortium upload its LMSD database of lipids into PubChem on a regular basis. The PubChem database is divided into three main categories: Compound database with over 32 million entries contains unique chemical substances derived from substance depositions; Substance database with over 74 million entries consists of chemical compounds submitted by depositors corresponding to mixtures, extracts, and complexes; BioAssay database containing biological activity results from over 1,600 high-throughput screening projects with several million measured values. The PubChem data deposition is open to the scientific community. The growing list of over 140 substance and 47 assay depositors represent all major sources including commercial vendors, public non-profit organizations, pharmaceutical companies, and individual contributors. The users can search PubChem compounds, substances, and bioassay databases using text, chemical structure, and arbitrary relationships of available data fields. The text-based searching supports the usage of a wide variety of parameters including name, formula, physiochemical properties, stereochemistry specifications, elements, and so on. The structure-based searching provides support for substructure/superstructure search, and identity/similarity search. The detailed search results page for compound along with structure and other basic information such as molecular weight, molecular formula, name, and common name provide the following additional information for a compound: synonyms, calculated physicochemical properties, substance information, biomedical annotation, pharmacological action and classification, chemical classification, safety and toxicology, links to exiting literature, and so on. The substance detailed results page, in addition to basic information such as chemical structure, name, and formula contains the following additional information: link to data depositor, links to any bioactivity information and other structurally related substances, and links to other databases maintained by the National Center of Biotechnology Information (NCBI). A variety of analysis tools such as bioactivity structure activity analysis and chemical structure clustering are also provided for the analysis of bioassay screening data.
The ZINC database contains commercially available small molecules for virtual screening. It contains over 13 million purchasable compounds including lipids. The users can search the ZINC database using compound name, chemical structure/substructure, physicochemical properties, vendor catalog number/source, and so on. The compound detailed search page includes chemical structure, name, formula, various calculated physicochemical properties, vendor and purchase information, and availability.
The ChemSpider is a chemical database and an online resource linking together compound information across the web. The compound information includes physical and chemical properties, chemical structure, systematic nomenclature spectral data, synthetic methods, known reactions, and safety information. The ChemSpider contains over 25 million unique chemical compounds sourced and linked to over 400 separate data sources including LIPID MAPS for lipids. The compound data is collected from over 50 difference sources. Additionally, the ChemSpider supports the uploading and curation of chemical structure and spectra data by the scientific community. The users can search ChemSpider database using text, chemical structure, and arbitrary relationships of available data fields. The text-based searching supports the usage of a wide variety of parameters including name, formula, physiochemical properties, literature search, and so on. The structure-based search supports chemical structure/substructure search along with arbitrary combinations of calculated physicochemical properties. The detailed search results page for a compound along with structure and other basic information such as molecular weight, molecular formula, name, and common name provide the following additional information: links to Wikipedia articles; associated data sources and commercial suppliers; patents; literature articles; calculates physicochemical properties; medical subject headings classification; pharmacological data; spectra; and inks to other literature data. The ChemSpider online resource also hosts a variety of web services such as chemical names to structure conversion, generation InChI strings, and calculation of various physicochemical properties.
The Chemical Abstract Service (CAS) is a comprehensive resource of chemical information combining databases with search and analysis tools available as chemical abstracts and chemical databases. The CAS provides two main chemical databases: CAplus and CAS Registry. The CAplus database consists of summaries and indices of scientific literature covering chemistry and chemistry related topics such as proteomics, genomics, and so on. The CAplus database contains over 33 million references and its coverage of scientific literature starts from early 1,800s and spans across 10,000 journals, technical reports, conference proceedings, and books in more than 60 languages; it also covers patent literature from over 60 countries. The CAS REGISTRY database contains over 52 million organic and inorganic chemical substances, and over 62 million sequences. Its coverage of chemical substances also starts from early 1,800 and covers substances from patents, chemical catalogs, and various web sources; the sequence data is retrieved from GenBank. In addition to basic compound information such as structure, name, formula, and molecular weight, the chemical substance record contains the following additional information: a unique CAS number, experimental and calculated physicochemical properties, ring analysis, and literature references. The CAS databases are searched using SciFinder which support both text-based and structure-based searching along with usage of other parameters during the search. In addition to CAplus and CAS REGISTRY, the CAS provides the following three databases: CASREACT, CHEMLIST, and CHEMCATS. The CASREACT and CHEMLIST databases contain information about chemical synthesis and regulated chemicals respectively. The CHEMACTS database contains over 44 million commercially available substances covering over 1,200 catalogs from 1,100 suppliers; it has over 12 million chemical substances with unique CAS numbers.
The eMolecules is an online resource for commercially available chemical molecules including lipids. It contains over 8 million unique molecules from variety of commercial catalogs and other on-line data sources such as National Institute of Standards and Technology (NIST), PubChem, DrugBank, and LIPID MAPS. The users can search eMolecules database using molecule name, molecule structure/substructure, suppliers, and various physicochemical properties. In addition to basic molecule information such as structure, name, formula, and molecular weight, the molecule record contains the information about suppliers and links to ordering chemicals.
The Beilstein database provides experimentally validated information about millions of chemical compounds uniquely identified by Beilstein Registry Numbers and chemical reactions compiled from scientific literature starting from 1,771. The original database was created using Beilstein's Handbook of Organic Chemistry and contains information about reactions, chemical substances, chemical structures, and physiochemical properties. The record for each substance has over 350 data fields corresponding to chemical and physical data along with appropriate literature references. The users can search the database using Reaxys system using one of the following three search options: reactions searching, substances and properties searching, and text searching. During reaction searching, a variety of other parameters such as starting materials, product, reaction conditions, and so on can also be specified. The substance and properties searching provides structure/substructure search along with specification of various physical and chemical properties. The text-based search allows the users to retrieve appropriate data using substance name, authors, and variety of other parameters. The detailed search results page for substance along with structure and other basis information such as molecular weight, molecular formula, name, and common name provide the following additional information: calculated physicochemical properties, physical and spectral data, synthesis information, and links to literature.
The KEGG LIGAND is a database of chemical compounds and reactions involved in biological pathways. It is a composite database consisting of three other databases: KEGG COMPOUND, KEGG ENZYME, and KEGG REACTION. The KEGG COMPOUND database contains information for over 7,000 metabolites and biologically relevant chemical compounds including lipids which are classified according to LIPID MAPS classification system and made available through KEGG BRITE database. The KEGG REACTION database contains information for over 5,000 reactions corresponding to metabolic and other reactions. The KEGG ENZYME database has information for over 3,800 enzymes involved in various transformations. The users can search KEGG LIGAND databases using text and chemical structures. The structure-based search supports structure/substructure search along with similarity searching. The detailed search results page for a compound along with structure and other basic information such as molecular weight, molecular formula, name, and common name provide the following additional information: links to ENZYME and REACTION databases, links to external data sources such as PubChem and CAS numbers.
3.1.1. Populating the Structure Database An object-relational database of lipids containing structural, biophysical and biochemical characteristics is available on the Lipidomics Gateway website with browsing and searching capabilities. The LMSD currently contains over 30,000 structures which are obtained from a variety of sources: LIPID MAPS Consortium's core laboratories and partners; Lipids identified by LIPID MAPS experiments; computationally generated structures for appropriate lipid classes; biologically relevant lipids manually curated from LIPID BANK, LIPIDAT and other public databases; peer-reviewed journals and book chapters describing lipid structures (). All structures have been classified and redrawn according to LIPID MAPS guidelines. After lipids have been selected for inclusion into LMSD, they are classified following the LIPID MAPS classification scheme as explained earlier under the classification, ontology and nomenclature of lipid molecules section. Structures of the lipids are drawn either manually or generated automatically by computational structure drawing tools developed by the LIPID MAPS consortium; the structure representation is consistent and adheres to the rules proposed by LIPID MAPS consortium. Based on its classification, each lipid structure in LMSD is assigned a unique LM ID. The format of the LM ID () not only maintains uniqueness of ID but also provides the capability to add new categories, classes, and subclasses as the need arises.
In addition to import and manual curation of biologically relevant lipids from other database sources, LMSD also stores their original IDs to enable cross-referencing. LMSD lipid structures are deposited into PubChem database periodically and a link to PubChem Substance ID (SID) is also maintained within LMSD. Access to complete set of LMSD lipid structures in the PubChem database is also available.
42LMSD structures are either drawn manually using ChemDraw or generated automatically by structure drawing tools developed by LIPID MAPS consortium for various subclasses in fatty acyls, glycerolipids, glycerophospholipids, sphingolipids, and sterols. The structure drawing tools are Perl scripts which can generate a large number of structures relatively quickly via command-line or web-based interface. In addition to consistent structure representations from lipid abbreviations, these scripts also generate ontological information such as number of double bonds, chain lengths at different positions on the glycerol backbone, number of various functional groups, and other structural characteristics. The ontological information is also loaded into LMSD. The InChI string and InChIKeys for lipid structures are also generated using command line executable available from InChI website and loaded into Oracle database
43 tables. The database schema used for LMSD is outlined in an entity relationship diagram in .
3.1.2. Searching the Structure Database The Lipidomics Gateway website supports the searching of LMSD database in three different ways: classification-based, text/ontology-based, and structure-based search. The classification-based browsing provides the capability to retrieve lipids based on the LIPID MAPS classification scheme. After the user selects one of the main categories of lipids, a listing of all lipids present in the selected category, along with a link to the set of lipids in each main class and subclass, is provided. The user may then select all lipids which belong to either a main class or a subclass and display the results as a result summary page.
In case of lipids containing multiple functional groups, assignment of a structure to a particular subclass may be somewhat subjective. For example, a fatty acid containing both epoxy and hydroxy groups could be assigned to either epoxy or hydroxy fatty acids subclass. To address this situation, an ontology-based search is also provided. The user may choose to search for lipids containing similar functionality and all the lipids with the specific functionality, irrespective of their subclass designation, would be retrieved. The text/ontology-based query page allows the user to search LMSD by any combination of these data fields: LM ID, common or systematic name, mass along with a tolerance value, formula, category, main class, subclass, and various combinations of ontology parameters. The structure-based search page provides the capability to search LMSD by performing a substructure or exact match using the structure drawn by the user. Three supported structure drawing tools are MarvinSketch,
16 JME,
44 and ChemDrawPro.
18 The first two of these structure drawing tools are Java applets and require only applet support in the browser. In addition to structure, the user can also specify LM ID and common or systematic name for the search.
The record details page, in addition to displaying the structure for the selected lipid, also contains all relevant information for that molecule such as, common and systematic names, synonyms, molecular formula, exact mass, classification hierarchy, InChIKey, and cross-references (if any) to other databases.
The default lipid detail page uses a Graphics Interchange Format (GIF) image for representing structure of the lipid. The decision to use GIF format for representing lipid structures in the web browser was made due to its native support across all the browsers. The structure may also be viewed and manipulated using MarvinView,
16 JMol,
17 and the ChemDraw, ActiveX/Plugin
18 formats where structures may be manipulated, scaled and saved in a number of high-resolution formats. shows screen shots of the LMSD user interface for lipid classification-based browsing, text-based and structure-based searching.