Untargeted metabolomics generates a huge amount of data. Software packages for automated data processing are crucial to successfully process these data. A variety of such software packages exist, but the outcome of data processing strongly depends on algorithm parameter settings. If they are not carefully chosen, suboptimal parameter settings can easily lead to biased results. Therefore, parameter settings also require optimization. Several parameter optimization approaches have already been proposed, but a software package for parameter optimization which is free of intricate experimental labeling steps, fast and widely applicable is still missing.
We implemented the software package IPO (‘Isotopologue Parameter Optimization’) which is fast and free of labeling steps, and applicable to data from different kinds of samples and data from different methods of liquid chromatography - high resolution mass spectrometry and data from different instruments.
IPO optimizes XCMS peak picking parameters by using natural, stable 13C isotopic peaks to calculate a peak picking score. Retention time correction is optimized by minimizing relative retention time differences within peak groups. Grouping parameters are optimized by maximizing the number of peak groups that show one peak from each injection of a pooled sample. The different parameter settings are achieved by design of experiments, and the resulting scores are evaluated using response surface models. IPO was tested on three different data sets, each consisting of a training set and test set. IPO resulted in an increase of reliable groups (146% - 361%), a decrease of non-reliable groups (3% - 8%) and a decrease of the retention time deviation to one third.
IPO was successfully applied to data derived from liquid chromatography coupled to high resolution mass spectrometry from three studies with different sample types and different chromatographic methods and devices. We were also able to show the potential of IPO to increase the reliability of metabolomics data.
The source code is implemented in R, tested on Linux and Windows and it is freely available for download at https://github.com/glibiseller/IPO. The training sets and test sets can be downloaded from https://health.joanneum.at/IPO.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-015-0562-8) contains supplementary material, which is available to authorized users.
Metabolomics; XCMS; Parameter optimization; Design of experiments; Isotopologue
Ontology-based enrichment analysis aids in the interpretation and understanding of large-scale biological data. Ontologies are hierarchies of biologically relevant groupings. Using ontology annotations, which link ontology classes to biological entities, enrichment analysis methods assess whether there is a significant over or under representation of entities for ontology classes. While many tools exist that run enrichment analysis for protein sets annotated with the Gene Ontology, there are only a few that can be used for small molecules enrichment analysis.
We describe BiNChE, an enrichment analysis tool for small molecules based on the ChEBI Ontology. BiNChE displays an interactive graph that can be exported as a high-resolution image or in network formats. The tool provides plain, weighted and fragment analysis based on either the ChEBI Role Ontology or the ChEBI Structural Ontology.
BiNChE aids in the exploration of large sets of small molecules produced within Metabolomics or other Systems Biology research contexts. The open-source tool provides easy and highly interactive web access to enrichment analysis with the ChEBI ontology tool and is additionally available as a standalone library.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-015-0486-3) contains supplementary material, which is available to authorized users.
Ontology; Enrichment; Small molecules
Mass spectrometry is an important analytical technology in metabolomics. After the initial feature detection and alignment steps, the raw data processing results in a high-dimensional data matrix of mass spectral features, which is then subjected to further statistical analysis. Univariate tests like Student’s t-test and Analysis of Variances (ANOVA) are hypothesis tests, which aim to detect differences between two or more sample classes, e.g., wildtype-mutant or between different doses of treatments. In both cases, one of the underlying assumptions is the independence between metabolic features. However, in mass spectrometry, a single metabolite usually gives rise to several mass spectral features, which are observed together and show a common behavior. This paper suggests to group the related features of metabolites with CAMERA into compound spectra, and then to use a multivariate statistical method to test whether a compound spectrum (and thus the actual metabolite) is differential between two sample classes. The multivariate method is first demonstrated with an analysis between wild-type and an over-expression line of the model plant Arabidopsis thaliana. For a quantitative evaluation data sets with a simulated known effect between two sample classes were analyzed. The spectra-wise analysis showed better detection results for all simulated effects.
metabolomics; statistics; hypothesis tests; multivariate analysis; mass spectrometry
The HUPO Proteomics Standards Initiative has developed several standardized data formats to facilitate data sharing in mass spectrometry (MS)-based proteomics. These allow researchers to report their complete results in a unified way. However, at present, there is no format to describe the final qualitative and quantitative results for proteomics and metabolomics experiments in a simple tabular format. Many downstream analysis use cases are only concerned with the final results of an experiment and require an easily accessible format, compatible with tools such as Microsoft Excel or R.
We developed the mzTab file format for MS-based proteomics and metabolomics results to meet this need. mzTab is intended as a lightweight supplement to the existing standard XML-based file formats (mzML, mzIdentML, mzQuantML), providing a comprehensive summary, similar in concept to the supplemental material of a scientific publication. mzTab files can contain protein, peptide, and small molecule identifications together with experimental metadata and basic quantitative information. The format is not intended to store the complete experimental evidence but provides mechanisms to report results at different levels of detail. These range from a simple summary of the final results to a representation of the results including the experimental design. This format is ideally suited to make MS-based proteomics and metabolomics results available to a wider biological community outside the field of MS. Several software tools for proteomics and metabolomics have already adapted the format as an output format. The comprehensive mzTab specification document and extensive additional documentation can be found online.
The ISA-Tab format and software suite have been developed to break the silo effect induced by technology-specific formats for a variety of data types and to better support experimental metadata tracking. Experimentalists seldom use a single technique to monitor biological signals. Providing a multi-purpose, pragmatic and accessible format that abstracts away common constructs for describing Investigations, Studies and Assays, ISA is increasingly popular. To attract further interest towards the format and extend support to ensure reproducible research and reusable data, we present the Risa package, which delivers a central component to support the ISA format by enabling effortless integration with R, the popular, open source data crunching environment.
The Risa package bridges the gap between the metadata collection and curation in an ISA-compliant way and the data analysis using the widely used statistical computing environment R. The package offers functionality for: i) parsing ISA-Tab datasets into R objects, ii) augmenting annotation with extra metadata not explicitly stated in the ISA syntax; iii) interfacing with domain specific R packages iv) suggesting potentially useful R packages available in Bioconductor for subsequent processing of the experimental data described in the ISA format; and finally v) saving back to ISA-Tab files augmented with analysis specific metadata from R. We demonstrate these features by presenting use cases for mass spectrometry data and DNA microarray data.
The Risa package is open source (with LGPL license) and freely available through Bioconductor. By making Risa available, we aim to facilitate the task of processing experimental data, encouraging a uniform representation of experimental information and results while delivering tools for ensuring traceability and provenance tracking.
The Risa package is available since Bioconductor 2.11 (version 1.0.0) and version 1.2.1 appeared in Bioconductor 2.12, both along with documentation and examples. The latest version of the code is at the development branch in Bioconductor and can also be accessed from GitHub https://github.com/ISA-tools/Risa, where the issue tracker allows users to report bugs or feature requests.
Leukocytoclastic vasculitis is a multicausal systemic inflammatory disease of the small vessels, histologically characterized by inflammation and deposition of both nuclear debris and fibrin in dermal postcapillary venules. The clinical picture typically involves palpable purpura of the lower legs and may be associated with general symptoms such as fatigue, arthralgia and fever. Involvement of the internal organs, most notably the kidneys, the central nervous system or the eyes, is possible and determines the prognosis. Oxaliplatin-induced leukocytoclastic vasculitis is a very rare event that limits treatment options in affected patients. We report 2 patients who developed the condition under chemotherapy for advanced rectal and metastatic colon carcinoma, respectively; a termination of the therapy was therefore necessary. While current therapies for colorectal cancer include the combination of multimodal treatment with new and targeted agents, rare and unusual side effects elicited by established agents also need to be taken into account for the clinical management.
Leukocytoclastic vasculitis; Oxaliplatin; Colorectal cancer; Chemotherapy-associated toxicity; Glomerulonephritis
The task in the critical assessment of small molecule identification (CASMI) contest category 2 was to determine the identification of (initially) unknown compounds for which high-resolution tandem mass spectra were published. We focused on computer-assisted methods that tried to correctly identify the compound automatically and entered the contest with MetFrag and MetFusion to score candidate structures retrieved from the PubChem structure database. MetFrag was combined with the metabolite-likeness score, which helped to improve the performance for the natural product challenges. We present the results, discuss the performance, and give details of how to interpret the MetFrag and MetFusion output.
mass spectrometry; metabolite identification; MetFrag; MetFusion; metabolite likeness; molecular formula
The Critical Assessment of Small Molecule Identification, or CASMI, contest was founded in 2012 to provide scientists with a common open dataset to evaluate their identification methods. In this article, the challenges and solutions for the inaugural CASMI 2012 are presented. The contest was split into four categories corresponding with tasks to determine molecular formula and molecular structure, each from two measurement types, liquid chromatography-high resolution mass spectrometry (LC-HRMS), where preference was given to high mass accuracy data, and gas chromatography-electron impact-mass spectrometry (GC-MS), i.e., unit accuracy data. These challenges were obtained from plant material, environmental samples and reference standards. It was surprisingly difficult to obtain data suitable for a contest, especially for GC-MS data where existing databases are very large. The level of difficulty of the challenges is thus quite varied. In this article, the challenges and the answers are discussed, and recommendations for challenge selection in subsequent CASMI contests are given.
mass spectrometry; metabolite identification; small molecule identification; contest; metabolomics; non-target identification
The Critical Assessment of Small Molecule Identification (CASMI) Contest was founded in 2012 to provide scientists with a common open dataset to evaluate their identification methods. In this review, we summarize the submissions, evaluate procedures and discuss the results. We received five submissions (three external, two internal) for LC–MS Category 1 (best molecular formula) and six submissions (three external, three internal) for LC–MS Category 2 (best molecular structure). No external submissions were received for the GC–MS Categories 3 and 4. The team of Dunn et al. from Birmingham had the most answers in the 1st place for Category 1, while Category 2 was won by H. Oberacher. Despite the low number of participants, the external and internal submissions cover a broad range of identification strategies, including expert knowledge, database searching, automated methods and structure generation. The results of Category 1 show that complementing automated strategies with (manual) expert knowledge was the most successful approach, while no automated method could compete with the power of spectral searching for Category 2—if the challenge was present in a spectral library. Every participant topped at least one challenge, showing that different approaches are still necessary for interpretation diversity.
mass spectrometry; metabolite identification; small molecule identification; contest; metabolomics; non-target identification; unknown identification
Liquid chromatography coupled to mass spectrometry is routinely used for metabolomics experiments. In contrast to the fairly routine and automated data acquisition steps, subsequent compound annotation and identification require extensive manual analysis and thus form a major bottle neck in data interpretation. Here we present CAMERA, a Bioconductor package integrating algorithms to extract compound spectra, annotate isotope and adduct peaks, and propose the accurate compound mass even in highly complex data. To evaluate the algorithms, we compared the annotation of CAMERA against a manually defined annotation for a mixture of known compounds spiked into a complex matrix at different concentrations. CAMERA successfully extracted accurate masses for 89.7% and 90.3% of the annotatable compounds in positive and negative ion mode, respectively. Furthermore, we present a novel annotation approach that combines spectral information of data acquired in opposite ion modes to further improve the annotation rate. We demonstrate the utility of CAMERA in two different, easily adoptable plant metabolomics experiments, where the application of CAMERA drastically reduced the amount of manual analysis.
Mass-spectrometry-based proteomics has become an important component of biological research. Numerous proteomics methods have been developed to identify and quantify the proteins in biological and clinical samples1, identify pathways affected by endogenous and exogenous perturbations2, and characterize protein complexes3. Despite successes, the interpretation of vast proteomics datasets remains a challenge. There have been several calls for improvements and standardization of proteomics data analysis frameworks, as well as for an application-programming interface for proteomics data access4,5. In response, we have developed the ProteoWizard Toolkit, a robust set of open-source, software libraries and applications designed to facilitate proteomics research. The libraries implement the first-ever, non-commercial, unified data access interface for proteomics, bridging field-standard open formats and all common vendor formats. In addition, diverse software classes enable rapid development of vendor-agnostic proteomics software. Additionally, ProteoWizard projects and applications, building upon the core libraries, are becoming standard tools for enabling significant proteomics inquiries.
MetaboLights (http://www.ebi.ac.uk/metabolights) is the first general-purpose, open-access repository for metabolomics studies, their raw experimental data and associated metadata, maintained by one of the major open-access data providers in molecular biology. Metabolomic profiling is an important tool for research into biological functioning and into the systemic perturbations caused by diseases, diet and the environment. The effectiveness of such methods depends on the availability of public open data across a broad range of experimental methods and conditions. The MetaboLights repository, powered by the open source ISA framework, is cross-species and cross-technique. It will cover metabolite structures and their reference spectra as well as their biological roles, locations, concentrations and raw data from metabolic experiments. Studies automatically receive a stable unique accession number that can be used as a publication reference (e.g. MTBLS1). At present, the repository includes 15 submitted studies, encompassing 93 protocols for 714 assays, and span over 8 different species including human, Caenorhabditis elegans, Mus musculus and Arabidopsis thaliana. Eight hundred twenty-seven of the metabolites identified in these studies have been mapped to ChEBI. These studies cover a variety of techniques, including NMR spectroscopy and mass spectrometry.
To make full use of research data, the bioscience community needs to adopt technologies and reward mechanisms that support interoperability and promote the growth of an open ‘data commoning’ culture. Here we describe the prerequisites for data commoning and present an established and growing ecosystem of solutions using the shared ‘Investigation-Study-Assay’ framework to support that vision.
Targeted proteomics via selected reaction monitoring is a powerful mass spectrometric technique affording higher dynamic range, increased specificity and lower limits of detection than other shotgun mass spectrometry methods when applied to proteome analyses. However, it involves selective measurement of predetermined analytes, which requires more preparation in the form of selecting appropriate signatures for the proteins and peptides that are to be targeted. There is a growing number of software programs and resources for selecting optimal transitions and the instrument settings used for the detection and quantification of the targeted peptides, but the exchange of this information is hindered by a lack of a standard format. We have developed a new standardized format, called TraML, for encoding transition lists and associated metadata. In addition to introducing the TraML format, we demonstrate several implementations across the community, and provide semantic validators, extensive documentation, and multiple example instances to demonstrate correctly written documents. Widespread use of TraML will facilitate the exchange of transitions, reduce time spent handling incompatible list formats, increase the reusability of previously optimized transitions, and thus accelerate the widespread adoption of targeted proteomics via selected reaction monitoring.
This report summarizes the proceedings of the second workshop of the ‘Minimum Information for Biological and Biomedical Investigations’ (MIBBI) consortium held on Dec 1-2, 2010 in Rüdesheim, Germany through the sponsorship of the Beilstein-Institute. MIBBI is an umbrella organization uniting communities developing Minimum Information (MI) checklists to standardize the description of data sets, the workflows by which they were generated and the scientific context for the work. This workshop brought together representatives of more than twenty communities to present the status of their MI checklists and plans for future development. Shared challenges and solutions were identified and the role of MIBBI in MI checklist development was discussed. The meeting featured some thirty presentations, wide-ranging discussions and breakout groups. The top outcomes of the two-day workshop as defined by the participants were: 1) the chance to share best practices and to identify areas of synergy; 2) defining a series of tasks for updating the MIBBI Portal; 3) reemphasizing the need to maintain independent MI checklists for various communities while leveraging common terms and workflow elements contained in multiple checklists; and 4) revision of the concept of the MIBBI Foundry to focus on the creation of a core set of MIBBI modules intended for reuse by individual MI checklist projects while maintaining the integrity of each MI project. Further information about MIBBI and its range of activities can be found at http://mibbi.org/.
Mass spectrometry is a fundamental tool for discovery and analysis in the life sciences. With the rapid advances in mass spectrometry technology and methods, it has become imperative to provide a standard output format for mass spectrometry data that will facilitate data sharing and analysis. Initially, the efforts to develop a standard format for mass spectrometry data resulted in multiple formats, each designed with a different underlying philosophy. To resolve the issues associated with having multiple formats, vendors, researchers, and software developers convened under the banner of the HUPO PSI to develop a single standard. The new data format incorporated many of the desirable technical attributes from the previous data formats, while adding a number of improvements, including features such as a controlled vocabulary with validation tools to ensure consistent usage of the format, improved support for selected reaction monitoring data, and immediately available implementations to facilitate rapid adoption by the community. The resulting standard data format, mzML, is a well tested open-source format for mass spectrometer output files that can be readily utilized by the community and easily adapted for incremental advances in mass spectrometry technology.
Summary: The first open source software suite for experimentalists and curators that (i) assists in the annotation and local management of experimental metadata from high-throughput studies employing one or a combination of omics and other technologies; (ii) empowers users to uptake community-defined checklists and ontologies; and (iii) facilitates submission to international public repositories.
Availability and Implementation: Software, documentation, case studies and implementations at http://www.isa-tools.org
Mass spectrometry has become the analytical method of choice in metabolomics research. The identification of unknown compounds is the main bottleneck. In addition to the precursor mass, tandem MS spectra carry informative fragment peaks, but the coverage of spectral libraries of measured reference compounds are far from covering the complete chemical space. Compound libraries such as PubChem or KEGG describe a larger number of compounds, which can be used to compare their in silico fragmentation with spectra of unknown metabolites.
We created the MetFrag suite to obtain a candidate list from compound libraries based on the precursor mass, subsequently ranked by the agreement between measured and in silico fragments. In the evaluation MetFrag was able to rank most of the correct compounds within the top 3 candidates returned by an exact mass query in KEGG. Compared to a previously published study, MetFrag obtained better results than the commercial MassFrontier software. Especially for large compound libraries, the candidates with a good score show a high structural similarity or just different stereochemistry, a subsequent clustering based on chemical distances reduces this redundancy. The in silico fragmentation requires less than a second to process a molecule, and MetFrag performs a search in KEGG or PubChem on average within 30 to 300 seconds, respectively, on an average desktop PC.
We presented a method that is able to identify small molecules from tandem MS measurements, even without spectral reference data or a large set of fragmentation rules. With today's massive general purpose compound libraries we obtain dozens of very similar candidates, which still allows a confident estimate of the correct compound class. Our tool MetFrag improves the identification of unknown substances from tandem MS spectra and delivers better results than comparable commercial software. MetFrag is available through a web application, web services and as java library. The web frontend allows the end-user to analyse single spectra and browse the results, whereas the web service and console application are aimed to perform batch searches and evaluation.
Liquid chromatography coupled to mass spectrometry (LC/MS) is an important analytical technology for e.g. metabolomics experiments. Determining the boundaries, centres and intensities of the two-dimensional signals in the LC/MS raw data is called feature detection. For the subsequent analysis of complex samples such as plant extracts, which may contain hundreds of compounds, corresponding to thousands of features – a reliable feature detection is mandatory.
We developed a new feature detection algorithm centWave for high-resolution LC/MS data sets, which collects regions of interest (partial mass traces) in the raw-data, and applies continuous wavelet transformation and optionally Gauss-fitting in the chromatographic domain. We evaluated our feature detection algorithm on dilution series and mixtures of seed and leaf extracts, and estimated recall, precision and F-score of seed and leaf specific features in two experiments of different complexity.
The new feature detection algorithm meets the requirements of current metabolomics experiments. centWave can detect close-by and partially overlapping features and has the highest overall recall and precision values compared to the other algorithms, matchedFilter (the original algorithm of XCMS) and the centroidPicker from MZmine. The centWave algorithm was integrated into the Bioconductor R-package XCMS and is available from
Current efforts in Metabolomics, such as the Human Metabolome Project, collect structures of biological metabolites as well as data for their characterisation, such as spectra for identification of substances and measurements of their concentration. Still, only a fraction of existing metabolites and their spectral fingerprints are known. Computer-Assisted Structure Elucidation (CASE) of biological metabolites will be an important tool to leverage this lack of knowledge. Indispensable for CASE are modules to predict spectra for hypothetical structures. This paper evaluates different statistical and machine learning methods to perform predictions of proton NMR spectra based on data from our open database NMRShiftDB.
A mean absolute error of 0.18 ppm was achieved for the prediction of proton NMR shifts ranging from 0 to 11 ppm. Random forest, J48 decision tree and support vector machines achieved similar overall errors. HOSE codes being a notably simple method achieved a comparatively good result of 0.17 ppm mean absolute error.
NMR prediction methods applied in the course of this work delivered precise predictions which can serve as a building block for Computer-Assisted Structure Elucidation for biological metabolites.
Liquid chromatography coupled to mass spectrometry (LC-MS) has become a prominent tool for the analysis of complex proteomics and metabolomics samples. In many applications multiple LC-MS measurements need to be compared, e. g. to improve reliability or to combine results from different samples in a statistical comparative analysis. As in all physical experiments, LC-MS data are affected by uncertainties, and variability of retention time is encountered in all data sets. It is therefore necessary to estimate and correct the underlying distortions of the retention time axis to search for corresponding compounds in different samples. To this end, a variety of so-called LC-MS map alignment algorithms have been developed during the last four years. Most of these approaches are well documented, but they are usually evaluated on very specific samples only. So far, no publication has been assessing different alignment algorithms using a standard LC-MS sample along with commonly used quality criteria.
We propose two LC-MS proteomics as well as two LC-MS metabolomics data sets that represent typical alignment scenarios. Furthermore, we introduce a new quality measure for the evaluation of LC-MS alignment algorithms. Using the four data sets to compare six freely available alignment algorithms proposed for the alignment of metabolomics and proteomics LC-MS measurements, we found significant differences with respect to alignment quality, running time, and usability in general.
The multitude of available alignment methods necessitates the generation of standard data sets and quality measures that allow users as well as developers to benchmark and compare their map alignment tools on a fair basis. Our study represents a first step in this direction. Currently, the installation and evaluation of the "correct" parameter settings can be quite a time-consuming task, and the success of a particular method is still highly dependent on the experience of the user. Therefore, we propose to continue and extend this type of study to a community-wide competition. All data as well as our evaluation scripts are available at .