|Home | About | Journals | Submit | Contact Us | Français|
GlycomeDB integrates the structural and taxonomic data of all major public carbohydrate databases, as well as carbohydrates contained in the Protein Data Bank, which renders the database currently the most comprehensive and unified resource for carbohydrate structures worldwide. GlycomeDB retains the links to the original databases and is updated at weekly intervals with the newest structures available from the source databases. The complete database can be downloaded freely or accessed through a Web-interface (www.glycome-db.org) that provides flexible and powerful search functionalities.
In a recent NIH whitepaper (1) the lack of a comprehensive, curated carbohydrate structure database was identified as the largest deficit in glycomics and glycobiology research. The Complex Carbohydrate Structure Database (CCSD) (2), initiated in the 1980s, was the largest effort to date to collect carbohydrate structures, mainly through retrospective manual extraction from the literature. The database contained about 50000 entries when it ceased to be updated in the late 1990s due to a lack of funding. Since then different specialized databases have been developed, which were initially seeded with a subset of the structures contained in the CCSD (3). Subsequently these databases were further extended with carbohydrate structures reflecting the research focus of the group that maintained the database. As a result, different valuable collections of carbohydrate data have emerged over recent years, for example: the Bacterial Carbohydrate Structure Database (BCSDB) (4) that collects all published bacterial carbohydrate structures (including their NMR spectra); the database of the Consortium for Functional Glycomics (CFG) that provides access to primary experimental data like that from glycan microarray screens (5); and the Kyoto Encyclopedia of Genes and Genomes (KEGG) that contains glycan-related biosynthetic pathways (6). Unfortunately each of these databases uses a different ‘sequence format’ for encoding carbohydrate structures, making it difficult to query across all public databases and analyze or compare their content, or simply to find out whether some additional information on a particular carbohydrate structure is available in any of the databases.
In 2005, a new initiative was begun to overcome the isolation of the public carbohydrate structure databases and to create a comprehensive index of all available structures with cross-links back to the original databases. To achieve this goal, structures of the freely available databases were translated to the GlycoCT sequence format (7), if possible, and stored in a new database, the GlycomeDB (8). The integration process is performed incrementally on a weekly basis, updating the GlycomeDB with the newest structures available in the associated databases. A JAVA software application called GlycoUpdateDB, which is complemented by a PostgreSQL database, is used to download the data from the public databases, reads their sequence notations and translates them to the GlycoCT encoding format. In addition, the taxonomic annotations are standardized semi-automatically based on curated tables that map the (free-text) annotations used in the source databases to NCBI taxonomy IDs [for more details see (8)]. To extract the carbohydrate structures from the Protein Data Bank (PDB) the pdb2linux tool is used (9). During the integration process automated checks are performed; structures that contain errors are reported to the administrators of the original database.
A major challenge during the initial integration process was the lack of a controlled vocabulary for carbohydrate and non-carbohydrate residue names. Even within a single database the same monosaccharide could have different names. In total 12253 different residues names were extracted from the sequences stored in the original carbohydrate databases, 5854 of which were identified as non-carbohydrate residues, mainly aglycons, such as amino acids, lipids or other small organic molecules attached to the reducing end of the carbohydrate. In total 5330 residue names could be identified as monosaccharides and were assigned a standardized GlycoCT encoding. The remaining 1069 residue names could not be interpreted so far. Based on the initial analysis of the namespace used to encode carbohydrate structures in the various databases, a dictionary has been created that contains mappings of the various encoding formats. The dictionary is now used to support the automated update process. If a new residue name appears, this is reported to the database curator who can then check whether the residue name is valid and include the new residue into the dictionary. Finally, a web interface has been developed (www.glycome-db.org) as a single query point for all open access carbohydrate structure databases (10).
GlycomeDB contains the unified carbohydrate sequences of all publicly accessible databases that contain carbohydrates structures. In total 121766 original sequences were parsed and integrated. Currently (August 2010) there are 35873 unique carbohydrate sequences—with taxonomic annotations if available—stored in GlycomeDB, 11822 of which are fully determined carbohydrates. A carbohydrate structure is defined as ‘fully determined’ if all monosaccharide characteristics (base type, anomer, ring size, substituents, modifications, etc) and all linkage positions are known. For polysaccharides the number of repeating units needs to be determined as well. An overview of the number of carbohydrate structures contributed by each database is given in Table 1.
Four major structural query options are implemented in GlycomeDB, namely ‘exact structure search’, ‘substructure search’, ‘similarity search’ and ‘maximum common substructure search’ (10). Structural queries can be entered graphically, either using GlycanBuilder (14) as the default, or using DrawRINGS, developed by a Japanese group at SOKA University, Tokyo (http://rings.t.soka.ac.jp). It is also possible to specify the query structure by using different machine-readable encoding formats, among which are CarbBank format (2), LINUCS (15), LinearCode® (16), BCSDB encoding (4) and Glyde II (http://glycomics.ccrc.uga.edu/core4/informatics-glyde-ii.html).
Next to the exact structure search, which is based on a comparison of ordered GlycoCT encodings (7), it is possible to generate queries with partially unknown information on the monosaccharide level, i.e. unknown anomeric center, ring size, or absolute configuration. It is also possible to restrict the search to specific taxonomic sources, as GlycomeDB applies consistently the NCBI taxonomy for the taxonomic data (17). The various search options can be combined sequentially to a multistep query refinement workflow, which allows very complex queries to be performed.
Using the GlycomeDB information page for individual structures (Figure 1), the user can use hyperlinks to navigate to the relevant pages of the external databases, which offer additional information such as literature references, experimental data or 3D structures. Additionally, information about bound aglycons and structural motifs, and a selectable sequence encoding are displayed. For more detailed information about the various aglycons attached to a particular carbohydrate, the user is guided to the original databases by following the link ‘Show remote structure evidences’.
GlycomeDB integrates the structural and taxonomic data of all major public carbohydrate databases, as well as carbohydrates contained in the Protein Data Bank, which renders the database currently the most comprehensive and unified resource for carbohydrate structures worldwide. Hyperlinks to the original source of the data are established, so users can use the GlycomeDB Web-portal to access efficiently relevant additional information, which is only available in the original databases. GlycomeDB is a database that integrates knowledge from other existing databases, therefore only carbohydrate structures that are stored in any of these databases will be integrated and cross-linked in GlycomeDB. Unfortunately, GlycomeDB cannot provide access to all published structures because, in contrast to proteomics and genomics, in glycomics there is not yet a procedure established that requires deposition of new structures in the context of publication. Therefore it can be assumed that not all published structures are currently available in a database. However, if a public database will be used in the future to deposit systematically new structures, these structures should also be automatically available in GlycomeDB. In general, the quality of the data depends on the quality of the referenced databases and their curation processes. Nevertheless GlycoUpdateDB applies additional validation checks during the integration process in order to improve the quality of the data. The curated database can be downloaded and used freely by interested scientists. It can be assumed that the development of annotation tools in MS and NMR that require a library of existing carbohydrate structures as reference data will benefit from the availability of GlycomeDB. Additionally, the data contained in GlycomeDB can facilitate statistical analyses of the ‘glycospace’ of different organisms (18,19).
GlycomeDB can be accessed using a Web-portal (http://www.glycome-db.org/) or the complete database can be downloaded as a compressed zip archive, containing all structures that have been integrated (http://www.glycome-db.org/downloads/). The structures are stored in regular XML files according to the Glyde II specification and can be used by any software that supports this format.
EU (6th Research Framework Program, RIDS contract number 011952); German Research Foundation (DFG BIB 46 HDdkz 01-01). Funding for open access charge: German Cancer Research Center (DKFZ), Heidelberg, Germany.
Conflict of interest statement. None declared.
The authors wish to express their gratitude to the EU (6th Research Framework Program, RIDS). The authors wish to thank the maintainers of all public glycomics database projects for their cooperation and helpfulness. This project would be unthinkable without their support. We also would like to thank all our collaborators from the EUROCarbDB project.