The Chemistry Development Kit (CDK) is an open source Java library for manipulating and processing chemical information. A key aspect in handling chemical structures is the determination of the chemical rings. The rings of a structure are used areas including descriptors, stereochemistry, similarity, screening and atom typing. The CDK includes multiple algorithms for determining the rings of a structure on demand. Non-unique descriptions of rings were often used due to the slower performance of the unique alternatives.
Efficient algorithms for handling chemical ring perception have been implemented and optimised in the CDK. The algorithms provide much faster computation of new and existing types of rings. Several optimisation and implementation considerations are discussed which improve real case usage. The performance is measured on several publicly available data sets and in several cases the new implementations were found to be more than an order of magnitude faster.
Algorithmic improvements allow handling of much larger datasets in reasonable time. Faster computation allows more appropriate rings to be utilised in procedures such as aromaticity. Several areas that require ring perception have also seen a noticeable improvement. The time taken to compute the unique rings is now comparable allowing a correct usage throughout the toolkit. All source code is open source and freely available.
Rings; Cycles; CDK
The application of reporting standards in metabolomics allow data from different laboratories to be shared, integrated and interpreted. Although minimum reporting standards related to metabolite identification were published in 2007, it is clear that significant efforts are required to ensure their continuous update and appropriate use by the metabolomics community. These include their use in metabolomics data submission (e.g., MetaboLights) and as a requirement for publication in peer-reviewed journals (e.g., Metabolomics). The Data Standards and Metabolite Identification Task Groups of the international Metabolomics Society are actively working to develop and promote these standards and educate the community on their use.
Summary: The Web Ontology Language (OWL) provides a sophisticated language for building complex domain ontologies and is widely used in bio-ontologies such as the Gene Ontology. The Protégé-OWL ontology editing tool provides a query facility that allows composition and execution of queries with the human-readable Manchester OWL syntax, with syntax checking and entity label lookup. No equivalent query facility such as the Protégé Description Logics (DL) query yet exists in web form. However, many users interact with bio-ontologies such as chemical entities of biological interest and the Gene Ontology using their online Web sites, within which DL-based querying functionality is not available. To address this gap, we introduce the OntoQuery web-based query utility.
Availability and implementation: The source code for this implementation together with instructions for installation is available at http://github.com/IlincaTudose/OntoQuery. OntoQuery software is fully compatible with all OWL-based ontologies and is available for download (CC-0 license). The ChEBI installation, ChEBI OntoQuery, is available at http://www.ebi.ac.uk/chebi/tools/ontoquery.
Cheminformaticians have to routinely process and analyse libraries of small molecules. Among other things, that includes the standardization of molecules, calculation of various descriptors, visualisation of molecular structures, and downstream analysis. For this purpose, scientific workflow platforms such as the Konstanz Information Miner can be used if provided with the right plug-in. A workflow-based cheminformatics tool provides the advantage of ease-of-use and interoperability between complementary cheminformatics packages within the same framework, hence facilitating the analysis process.
KNIME-CDK comprises functions for molecule conversion to/from common formats, generation of signatures, fingerprints, and molecular properties. It is based on the Chemistry Development Toolkit and uses the Chemical Markup Language for persistence. A comparison with the cheminformatics plug-in RDKit shows that KNIME-CDK supports a similar range of chemical classes and adds new functionality to the framework. We describe the design and integration of the plug-in, and demonstrate the usage of the nodes on ChEBI, a library of small molecules of biological interest.
KNIME-CDK is an open-source plug-in for the Konstanz Information Miner, a free workflow platform. KNIME-CDK is build on top of the open-source Chemistry Development Toolkit and allows for efficient cross-vendor structural cheminformatics. Its ease-of-use and modularity enables researchers to automate routine tasks and data analysis, bringing complimentary cheminformatics functionality to the workflow environment.
Cheminformatics; Workflows; Data integration; Software library
The Gene Ontology (GO) facilitates the description of the action of gene products in a biological context. Many GO terms refer to chemical entities that participate in biological processes. To facilitate accurate and consistent systems-wide biological representation, it is necessary to integrate the chemical view of these entities with the biological view of GO functions and processes. We describe a collaborative effort between the GO and the Chemical Entities of Biological Interest (ChEBI) ontology developers to ensure that the representation of chemicals in the GO is both internally consistent and in alignment with the chemical expertise captured in ChEBI.
We have examined and integrated the ChEBI structural hierarchy into the GO resource through computationally-assisted manual curation of both GO and ChEBI. Our work has resulted in the creation of computable definitions of GO terms that contain fully defined semantic relationships to corresponding chemical terms in ChEBI.
The set of logical definitions using both the GO and ChEBI has already been used to automate aspects of GO development and has the potential to allow the integration of data across the domains of biology and chemistry. These logical definitions are available as an extended version of the ontology from http://purl.obolibrary.org/obo/go/extensions/go-plus.owl.
Summary: Genome-scale metabolic models often lack annotations that would allow them to be used for further analysis. Previous efforts have focused on associating metabolites in the model with a cross reference, but this can be problematic if the reference is not freely available, multiple resources are used or the metabolite is added from a literature review. Associating each metabolite with chemical structure provides unambiguous identification of the components and a more detailed view of the metabolism. We have developed an open-source desktop application that simplifies the process of adding database cross references and chemical structures to genome-scale metabolic models. Annotated models can be exported to the Systems Biology Markup Language open interchange format.
Availability: Source code, binaries, documentation and tutorials are freely available at http://johnmay.github.com/metingear. The application is implemented in Java with bundles available for MS Windows and Macintosh OS X.
Supplementary data are available at Bioinformatics online.
With ever-increasing amounts of metabolomics data produced each year, there is an even greater need to disseminate data and knowledge produced in a standard and reproducible way. To assist with this a general purpose, open source metabolomics repository, MetaboLights, was launched in 2012. To promote a community standard, initially culminated as metabolomics standards initiative (MSI), COordination of Standards in MetabOlomicS (COSMOS) was introduced. COSMOS aims to link life science e-infrastructures within the worldwide metabolomics community as well as develop and maintain open source exchange formats for raw and processed data, ensuring better flow of metabolomics information.
Metabolomics; MetaboLights; Database; Repository; Data sharing; Standard; MSI; ISA-Tab; Curation; COSMOS
Protein sequence databases are the pillar upon which modern proteomics is supported, representing a stable reference space of predicted and validated proteins. One example of such resources is UniProt, enriched with both expertly curated and automatic annotations. Taken largely for granted, similar mature resources such as UniProt are not available yet in some other “omics” fields, lipidomics being one of them. While having a seasoned community of wet lab scientists, lipidomics lies significantly behind proteomics in the adoption of data standards and other core bioinformatics concepts. This work aims to reduce the gap by developing an equivalent resource to UniProt called ‘LipidHome’, providing theoretically generated lipid molecules and useful metadata. Using the ‘FASTLipid’ Java library, a database was populated with theoretical lipids, generated from a set of community agreed upon chemical bounds. In parallel, a web application was developed to present the information and provide computational access via a web service. Designed specifically to accommodate high throughput mass spectrometry based approaches, lipids are organised into a hierarchy that reflects the variety in the structural resolution of lipid identifications. Additionally, cross-references to other lipid related resources and papers that cite specific lipids were used to annotate lipid records. The web application encompasses a browser for viewing lipid records and a ‘tools’ section where an MS1 search engine is currently implemented. LipidHome can be accessed at http://www.ebi.ac.uk/apweiler-srv/lipidhome.
MetaboLights is the first general-purpose open-access curated repository for metabolomic studies, their raw experimental data and associated metadata, maintained by one of the major open-access data providers in molecular biology. Increases in the number of depositions, number of samples per study and the file size of data submitted to MetaboLights present a challenge for the objective of ensuring high-quality and standardized data in the context of diverse metabolomic workflows and data representations. Here, we describe the MetaboLights curation pipeline, its challenges and its practical application in quality control of complex data depositions.
User-centred design (UCD) is a type of user interface design in which the needs and desires of users are taken into account at each stage of the design process for a service or product; often for software applications and websites. Its goal is to facilitate the design of software that is both useful and easy to use. To achieve this, you must characterise users’ requirements, design suitable interactions to meet their needs, and test your designs using prototypes and real life scenarios.
For bioinformatics, there is little practical information available regarding how to carry out UCD in practice. To address this we describe a complete, multi-stage UCD process used for creating a new bioinformatics resource for integrating enzyme information, called the Enzyme Portal (http://www.ebi.ac.uk/enzymeportal). This freely-available service mines and displays data about proteins with enzymatic activity from public repositories via a single search, and includes biochemical reactions, biological pathways, small molecule chemistry, disease information, 3D protein structures and relevant scientific literature.
We employed several UCD techniques, including: persona development, interviews, ‘canvas sort’ card sorting, user workflows, usability testing and others. Our hope is that this case study will motivate the reader to apply similar UCD approaches to their own software design for bioinformatics. Indeed, we found the benefits included more effective decision-making for design ideas and technologies; enhanced team-working and communication; cost effectiveness; and ultimately a service that more closely meets the needs of our target audience.
3D protein structure; Biological pathways; Card sorting; Design; Enzyme; Enzyme portal; Implementation; Personae; Prototyping; User-centered design (USA spelling); User-centred design; User experience; User profiles; User requirements; Usability testing
ChEBI (http://www.ebi.ac.uk/chebi) is a database and ontology of chemical entities of biological interest. Over the past few years, ChEBI has continued to grow steadily in content, and has added several new features. In addition to incorporating all user-requested compounds, our annotation efforts have emphasized immunology, natural products and metabolites in many species. All database entries are now ‘is_a’ classified within the ontology, meaning that all of the chemicals are available to semantic reasoning tools that harness the classification hierarchy. We have completely aligned the ontology with the Open Biomedical Ontologies (OBO) Foundry-recommended upper level Basic Formal Ontology. Furthermore, we have aligned our chemical classification with the classification of chemical-involving processes in the Gene Ontology (GO), and as a result of this effort, the majority of chemical-involving processes in GO are now defined in terms of the ChEBI entities that participate in them. This effort necessitated incorporating many additional biologically relevant compounds. We have incorporated additional data types including reference citations, and the species and component for metabolites. Finally, our website and web services have had several enhancements, most notably the provision of a dynamic new interactive graph-based ontology visualization.
The availability of comprehensive information about enzymes plays an important role in answering questions relevant to interdisciplinary fields such as biochemistry, enzymology, biofuels, bioengineering and drug discovery. At the EMBL European Bioinformatics Institute, we have developed an enzyme portal (http://www.ebi.ac.uk/enzymeportal) to provide this wealth of information on enzymes from multiple in-house resources addressing particular data classes: protein sequence and structure, reactions, pathways and small molecules. The fact that these data reside in separate databases makes information discovery cumbersome. The main goal of the portal is to simplify this process for end users.
MetaboLights (http://www.ebi.ac.uk/metabolights) is the first general-purpose, open-access repository for metabolomics studies, their raw experimental data and associated metadata, maintained by one of the major open-access data providers in molecular biology. Metabolomic profiling is an important tool for research into biological functioning and into the systemic perturbations caused by diseases, diet and the environment. The effectiveness of such methods depends on the availability of public open data across a broad range of experimental methods and conditions. The MetaboLights repository, powered by the open source ISA framework, is cross-species and cross-technique. It will cover metabolite structures and their reference spectra as well as their biological roles, locations, concentrations and raw data from metabolic experiments. Studies automatically receive a stable unique accession number that can be used as a publication reference (e.g. MTBLS1). At present, the repository includes 15 submitted studies, encompassing 93 protocols for 714 assays, and span over 8 different species including human, Caenorhabditis elegans, Mus musculus and Arabidopsis thaliana. Eight hundred twenty-seven of the metabolites identified in these studies have been mapped to ChEBI. These studies cover a variety of techniques, including NMR spectroscopy and mass spectrometry.
Exciting funding initiatives are emerging in Europe and the US for metabolomics data production, storage, dissemination and analysis. This is based on a rich ecosystem of resources around the world, which has been build during the past ten years, including but not limited to resources such as MassBank in Japan and the Human Metabolome Database in Canada. Now, the European Bioinformatics Institute has launched MetaboLights, a database for metabolomics experiments and the associated metadata (http://www.ebi.ac.uk/metabolights). It is the first comprehensive, cross-species, cross-platform metabolomics database maintained by one of the major open access data providers in molecular biology. In October, the European COSMOS consortium will start its work on Metabolomics data standardization, publication and dissemination workflows. The NIH in the US is establishing 6–8 metabolomics services cores as well as a national metabolomics repository. This communication reports about MetaboLights as a new resource for Metabolomics research, summarises the related developments and outlines how they may consolidate the knowledge management in this third large omics field next to proteomics and genomics.
Metabolomics; Databases; ISA-Tab; ISA commons
Designers have a saying that “the joy of an early release lasts but a short time. The bitterness of an unusable system lasts for years.” It is indeed disappointing to discover that your data resources are not being used to their full potential. Not only have you invested your time, effort, and research grant on the project, but you may face costly redesigns if you want to improve the system later. This scenario would be less likely if the product was designed to provide users with exactly what they need, so that it is fit for purpose before its launch. We work at EMBL-European Bioinformatics Institute (EMBL-EBI), and we consult extensively with life science researchers to find out what they need from biological data resources. We have found that although users believe that the bioinformatics community is providing accurate and valuable data, they often find the interfaces to these resources tricky to use and navigate. We believe that if you can find out what your users want even before you create the first mock-up of a system, the final product will provide a better user experience. This would encourage more people to use the resource and they would have greater access to the data, which could ultimately lead to more scientific discoveries. In this paper, we explore the need for a user-centred design (UCD) strategy when designing bioinformatics resources and illustrate this with examples from our work at EMBL-EBI. Our aim is to introduce the reader to how selected UCD techniques may be successfully applied to software design for bioinformatics.
Natural product-likeness of a molecule, i.e. similarity of this molecule to the structure space covered by natural products, is a useful criterion in screening compound libraries and in designing new lead compounds. A closed source implementation of a natural product-likeness score, that finds its application in virtual screening, library design and compound selection, has been previously reported by one of us. In this note, we report an open-source and open-data re-implementation of this scoring system, illustrate its efficiency in ranking small molecules for natural product likeness and discuss its potential applications.
The Natural-Product-Likeness scoring system is implemented as Taverna 2.2 workflows, and is available under Creative Commons Attribution-Share Alike 3.0 Unported License at http://www.myexperiment.org/packs/183.html. It is also available for download as executable standalone java package from http://sourceforge.net/projects/np-likeness/under Academic Free License.
Our open-source, open-data Natural-Product-Likeness scoring system can be used as a filter for metabolites in Computer Assisted Structure Elucidation or to select natural-product-like molecules from molecular libraries for the use as leads in drug discovery.
Recent years have seen an explosion in the availability of data in the chemistry domain. With this information explosion, however, retrieving relevant results from the available information, and organising those results, become even harder problems. Computational processing is essential to filter and organise the available resources so as to better facilitate the work of scientists. Ontologies encode expert domain knowledge in a hierarchically organised machine-processable format. One such ontology for the chemical domain is ChEBI. ChEBI provides a classification of chemicals based on their structural features and a role or activity-based classification. An example of a structure-based class is 'pentacyclic compound' (compounds containing five-ring structures), while an example of a role-based class is 'analgesic', since many different chemicals can act as analgesics without sharing structural features. Structure-based classification in chemistry exploits elegant regularities and symmetries in the underlying chemical domain. As yet, there has been neither a systematic analysis of the types of structural classification in use in chemistry nor a comparison to the capabilities of available technologies.
We analyze the different categories of structural classes in chemistry, presenting a list of patterns for features found in class definitions. We compare these patterns of class definition to tools which allow for automation of hierarchy construction within cheminformatics and within logic-based ontology technology, going into detail in the latter case with respect to the expressive capabilities of the Web Ontology Language and recent extensions for modelling structured objects. Finally we discuss the relationships and interactions between cheminformatics approaches and logic-based approaches.
Systems that perform intelligent reasoning tasks on chemistry data require a diverse set of underlying computational utilities including algorithmic, statistical and logic-based tools. For the task of automatic structure-based classification of chemical entities, essential to managing the vast swathes of chemical data being brought online, systems which are capable of hybrid reasoning combining several different approaches are crucial. We provide a thorough review of the available tools and methodologies, and identify areas of open research.
The advent of high-throughput experimentation in biochemistry has led to the generation of vast amounts of chemical data, necessitating the development of novel analysis, characterization, and cataloguing techniques and tools. Recently, a movement to publically release such data has advanced biochemical structure-activity relationship research, while providing new challenges, the biggest being the curation, annotation, and classification of this information to facilitate useful biochemical pattern analysis. Unfortunately, the human resources currently employed by the organizations supporting these efforts (e.g. ChEBI) are expanding linearly, while new useful scientific information is being released in a seemingly exponential fashion. Compounding this, currently existing chemical classification and annotation systems are not amenable to automated classification, formal and transparent chemical class definition axiomatization, facile class redefinition, or novel class integration, thus further limiting chemical ontology growth by necessitating human involvement in curation. Clearly, there is a need for the automation of this process, especially for novel chemical entities of biological interest.
To address this, we present a formal framework based on Semantic Web technologies for the automatic design of chemical ontology which can be used for automated classification of novel entities. We demonstrate the automatic self-assembly of a structure-based chemical ontology based on 60 MeSH and 40 ChEBI chemical classes. This ontology is then used to classify 200 compounds with an accuracy of 92.7%. We extend these structure-based classes with molecular feature information and demonstrate the utility of our framework for classification of functionally relevant chemicals. Finally, we discuss an iterative approach that we envision for future biochemical ontology development.
We conclude that the proposed methodology can ease the burden of chemical data annotators and dramatically increase their productivity. We anticipate that the use of formal logic in our proposed framework will make chemical classification criteria more transparent to humans and machines alike and will thus facilitate predictive and integrative bioactivity model development.
The computational processing and analysis of small molecules is at heart of cheminformatics and structural bioinformatics and their application in e.g. metabolomics or drug discovery. Pipelining or workflow tools allow for the Lego™-like, graphical assembly of I/O modules and algorithms into a complex workflow which can be easily deployed, modified and tested without the hassle of implementing it into a monolithic application. The CDK-Taverna project aims at building a free open-source cheminformatics pipelining solution through combination of different open-source projects such as Taverna, the Chemistry Development Kit (CDK) or the Waikato Environment for Knowledge Analysis (WEKA). A first integrated version 1.0 of CDK-Taverna was recently released to the public.
The CDK-Taverna project was migrated to the most up-to-date versions of its foundational software libraries with a complete re-engineering of its worker's architecture (version 2.0). 64-bit computing and multi-core usage by paralleled threads are now supported to allow for fast in-memory processing and analysis of large sets of molecules. Earlier deficiencies like workarounds for iterative data reading are removed. The combinatorial chemistry related reaction enumeration features are considerably enhanced. Additional functionality for calculating a natural product likeness score for small molecules is implemented to identify possible drug candidates. Finally the data analysis capabilities are extended with new workers that provide access to the open-source WEKA library for clustering and machine learning as well as training and test set partitioning. The new features are outlined with usage scenarios.
CDK-Taverna 2.0 as an open-source cheminformatics workflow solution matured to become a freely available and increasingly powerful tool for the biosciences. The combination of the new CDK-Taverna worker family with the already available workflows developed by a lively Taverna community and published on myexperiment.org enables molecular scientists to quickly calculate, process and analyse molecular data as typically found in e.g. today's systems biology scenarios.
Rhea (http://www.ebi.ac.uk/rhea) is a comprehensive resource of expert-curated biochemical reactions. Rhea provides a non-redundant set of chemical transformations for use in a broad spectrum of applications, including metabolic network reconstruction and pathway inference. Rhea includes enzyme-catalyzed reactions (covering the IUBMB Enzyme Nomenclature list), transport reactions and spontaneously occurring reactions. Rhea reactions are described using chemical species from the Chemical Entities of Biological Interest ontology (ChEBI) and are stoichiometrically balanced for mass and charge. They are extensively manually curated with links to source literature and other public resources on metabolism including enzyme and pathway databases. This cross-referencing facilitates the mapping and reconciliation of common reactions and compounds between distinct resources, which is a common first step in the reconstruction of genome scale metabolic networks and models.
The Blue Obelisk movement was established in 2005 as a response to the lack of Open Data, Open Standards and Open Source (ODOSOS) in chemistry. It aims to make it easier to carry out chemistry research by promoting interoperability between chemistry software, encouraging cooperation between Open Source developers, and developing community resources and Open Standards.
This contribution looks back on the work carried out by the Blue Obelisk in the past 5 years and surveys progress and remaining challenges in the areas of Open Data, Open Standards, and Open Source in chemistry.
We show that the Blue Obelisk has been very successful in bringing together researchers and developers with common interests in ODOSOS, leading to development of many useful resources freely available to the chemistry community.