|Home | About | Journals | Submit | Contact Us | Français|
Motivation: The study of metabolites (metabolomics) is increasingly being applied to investigate microbial, plant, environmental and mammalian systems. One of the limiting factors is that of chemically identifying metabolites from mass spectrometric signals present in complex datasets.
Results: Three workflows have been developed to allow for the rapid, automated and high-throughput annotation and putative metabolite identification of electrospray LC-MS-derived metabolomic datasets. The collection of workflows are defined as PUTMEDID_LCMS and perform feature annotation, matching of accurate m/z to the accurate mass of neutral molecules and associated molecular formula and matching of the molecular formulae to a reference file of metabolites. The software is independent of the instrument and data pre-processing applied. The number of false positives is reduced by eliminating the inaccurate matching of many artifact, isotope, multiply charged and complex adduct peaks through complex interrogation of experimental data.
Availability: The workflows, standard operating procedure and further information are publicly available at http://www.mcisb.org/resources/putmedid.html.
Systems biology is applied to study the components of, and more importantly their complex interactions in, biological systems. One set of components which are studied in systems biology investigations are metabolites, either by targeted or holistic profiling experimental strategies (Dunn et al., 2011). Low molecular weight inorganic and organic metabolites play important roles in the operation and maintenance of biological systems. The study of metabolites (metabolomics) is increasingly being applied to investigate microbial (Bradley et al., 2009; MacKenzie et al., 2008; Mashego et al., 2007), plant (Allwood et al., 2008; Fernie and Schauer, 2009; Hall et al., 2008), environmental (Bundy et al., 2009; Viant et al., 2006) and mammalian (Griffin, 2008; Kenny et al., 2010; Lewis et al., 2008) systems. Many studies follow a hypothesis generating or inductive strategy (Kell and Oliver, 2004) and start from a small and known subset of biological knowledge. Valid experiments are designed to acquire robust and reproducible data on a wide range of different metabolites and metabolite classes from carefully selected samples. Subsequent data analysis procedures define the metabolic differences associated with a biological change related to genotype, biological perturbation or environmental intervention (for example, drug therapy). These studies employ a metabolic profiling strategy to detect a wide range of (but not all) metabolites from numerous biochemical classes to obtain maximum metabolic information rapidly. This strategy provides the detection of hundreds or thousands of metabolites. However, due to the diverse range of chemical and physical properties and the wide concentration range of metabolites within the metabolome, no single analytical technology can provide the non-biased quantitative detection of all metabolites in a biological system (Dunn, 2008). Metabolic profiling provides semi-quantitative data, typically as chromatographic peak areas, rather than absolute quantitation where metabolite concentrations would be reported. Mass spectrometry (MS) and nuclear magnetic resonance spectroscopy (NMR) have been widely employed and provide complementary roles (Dunn et al., 2005, 2011). Chromatography-MS techniques provide advantages for these highly complex biological samples and include gas chromatography (GC-MS), liquid chromatography (LC-MS) and LC derivatives including ultra performance liquid chromatography (UPLC-MS). Capillary Electrophoresis-MS (Soga et al., 2003) and LC-MS apply electrospray ionization. Direct infusion mass spectrometry (DIMS) can also be applied, though lacks the chromatographic separation of metabolites (Dunn et al., 2005; Southam et al., 2007). Hyphenated platforms can provide (with suitable operation and mass calibration) high separation resolution, high mass resolution and mass accuracy, typical limits of detection of micromol per litre and the ability to identify metabolites through a combination of Retention Time (RT)/index, accurate mass and gas-phase fragmentation-derived mass spectra. Each of these platforms, whether hyphenated or non-hyphenated, provide different advantages and disadvantages for metabolite identification as has been reviewed previously (Dunn et al., 2011). The increasing use of high mass resolution LC-MS and UPLC-MS platforms provides the detection of many thousands of features [see Brown et al. (2009) for a comparison of the features detected related to sample type] with high mass accuracy and has led to a need to develop data handling methods for the conversion of this raw analytical data into biological knowledge.
One of the data processing procedures essential in metabolic profiling is metabolite identification which has been reviewed previously (Dunn et al., 2011; Wishart, 2009). Guidelines have been provided by the Metabolomics Standards Initiative to define how the different levels of metabolite identification can be reported (Sumner et al., 2007). A range of approaches can be applied for metabolite identification. Two generalized types of identification are achievable: putative identification and definitive identification.
Putative identification usually employs one or more molecular properties for identification, but does not compare these to the same properties of an authentic chemical standard as is performed for definitive identification. The accurate mass (or m/z) of an analyte and its associated isotopologues can be used to define molecular formulae (MFs) from which suitable metabolites can be derived by searching a range of electronic resources [e.g. PubChem (http://pubchem.ncbi.nlm.nih.gov/), HMDB (http://www.hmdb.ca/), KEGG (http://www.genome.jp/kegg/) and MMD (Brown et al., 2009)] and has been previously shown (Brown et al., 2005; Junot et al., 2010; Lane et al., 2008; Rogers et al., 2009). Direct matching of accurate mass (or m/z) to data in electronic resources without intermediate matching to MF can also be performed. However, structural isomers and stereoisomers have the same accurate mass and therefore require a separate, orthogonal property for identification of all potential isomers. Typically, this is chromatographic separation though separation of isomers is not always achievable. Separation of enantiomers requires a chiral chromatography column.
Definitive identification employs at least two properties (typically RT or index and fragmentation mass spectrum) and compares these properties to an authentic chemical standard analysed under identical analytical conditions. In LC-MS and UPLC-MS applications, the accurate masses of the detected ions is employed in combination with other rules (e.g. isotope ratio of 12C and 13C isotopic peaks to define the number of carbon atoms present in the MF; calculated as peak area 13C isotopologue/peak area 12C isotopologue) to generate MF and thus provide putative metabolite identification(s). Specific rules are not always applicable. For example, 13C/12C isotopic peaks can only be applied on instruments where accurate isotopic ratios are detected and where 13C-artificially labelled metabolites have not been applied in the biological experiment. Fragmentation mass spectra (MS/MS or MSn) are then used to provide increased confidence through comparison to authentic chemical standards or to in silico-derived fragmentation mass spectra to give an unequivocal metabolite identification (Wolf et al., 2010). It should be noted that not all authentic chemical standards are commercially available and that MS/MS and MSn fragmentation mass spectra are not always accurate in unequivocal identification of two isomeric metabolites.
Studies by the authors have assessed the level of complexity of electrospray UPLC-MS data derived from biological extracts in a metabolic profiling strategy (Brown et al., 2009). This work has shown that a multitude of different ion types are observed including commonly described ions (e.g. protonated, deprotonated, sodium or potassium adducts and 13C isotope). However, many other unexpected types of ions including adducts (e.g. complex combinations of sodium chloride and formate dependant on the matrix type and mobile phase), fragments, dimers, multiply charged and instrument specific ions (e.g. Fourier Transform (FT) artifact peaks) are also detected. Each different ion type is defined as a feature, whose accurate mass (or m/z) is unique but whose RT and chromatographic peak profiles are identical. Information on the type of ion should be applied in metabolite identification (Brown et al., 2009; Draper et al., 2009).
Automated software or workflows for high-throughput identification of large metabolomic datasets are not freely available. Currently, metabolite identification is a manual or semi-automated process assessing those features of biological interest and not the complete set of detected features (Dunn, 2008). For metabolomics to be successful it is essential to derive biological knowledge from analytical data, a view emphasized by a recent Metabolomics ASMS Workshop Survey 2009 which found that the biggest bottlenecks in metabolomics were thought to be identification of metabolites and assigning of biological interest (http://fiehnlab.ucdavis.edu/staff/kind/Metabolomics-Survey-2009).
To fill the gap in requirements, three workflows have been written to perform for the first time integrated, automated and high-throughput annotation and putative metabolite identification of electrospray LC-MS and UPLC-MS metabolomic datasets in a freely available package. The workflows were developed in the Taverna Workflow Management System to provide flexibility in their operation and the ability to rapidly and simply integrate with web services and other Taverna workflows in the future [for example, see Li et al. (2008)], so as to provide integrated data analysis and bioinformatics or cheminformatics packages. Examples are available on myExperiment, a repository of workflows freely available to the scientific community (http://www.myexperiment.org/), including a workflow to perform data pre-processing with XCMS and a workflow to perform in silico fragmentation applying MetFrag. The achievement of this level of integration would be more technically demanding and would require significantly greater expertise and time if coded in many other programming languages. Taverna is also easy to operate for relative novices with minimal training as the process involves defining parameter values and files only.
The current lack of freely available workflows or software to process deconvoluted data acquired from electrospray LC-MS experiments led the authors to develop three workflows. The workflows have been developed in Taverna (Hull et al., 2006) using Beanshell, a Java scripting language, which is enabled in Taverna and can perform data manipulation, parsing and formatting. Taverna can be downloaded from http://www.taverna.org.uk. The workflows were developed under Windows using Taverna v1.7.0 and subsequently tested using Taverna Workbench 2.2.0. In combination, the workflows perform the automated, high-throughput annotation and putative metabolite identification of electrospray LC-MS and UPLC-MS metabolomic datasets. The software has been coded as a series of separate workflows to allow more flexibility in the analysis of data by obviating the need to re-run the whole pipeline when altering one of the workflow parameters, such as the mass tolerance or database. This approach also reduces the likelihood of memory problems when handling large datasets on computers with small RAM.
The workflows, related files and standard operating procedure (SOP) are available to the user community on http://www.mcisb.org/resources/putmedid.html and will also be placed on MyExperiment (http://www.myexperiment.org/). In general, the input and output files are tab-delimited (*.txt) files and are sorted by ascending accurate mass or MF as appropriate (ordered as C, H, N, O, P, S, Br, Cl, F, Si in ascending alphanumeric form as is standard for PubChem). Internal checks are made within the workflows to ensure that the number of features in both peak and data files match (workflow for correlation analysis) and that the study and reference input files are sorted either by accurate m/z (workflow for metabolic feature annotation) or MF (workflow for metabolite annotation). Termination of the process and reporting of an informative error message occurs if this is not the case. The three workflows are described in detail in the available SOP and briefly below.
The workflow for correlation analysis (List_CorrData) allows the user to calculate either Pearson or Spearman rank correlations or read in previously calculated correlation data. The correlation calculations allow for NaN, Inf and 0 in the input data and are equivalent to using the Matlab (http://www.mathworks.co.uk/) corr function with the following parameters:
corr(Xdata, ‘rows’, ‘pairwise’, ‘type’, ‘Pearson’) or corr(Xdata, ‘rows’, ‘pairwise’, ‘type’, ‘Spearman’)
The workflow for metabolic feature annotation (annotate_MassMatch) uses correlation coefficient information calculated in the workflow for correlation analysis, accurate m/z difference, RT and median peak area data to group together and annotate features with the type of ion (isotope, adduct, dimer, others) originating from the same metabolite. The same metabolite can be detected as different ion types each with different m/z. Following annotation, the experimentally determined accurate m/z are matched to the accurate m/z of unique MF in a reference file within a specified m/z tolerance.
In the workflow for metabolite annotation (matchMMF_MF), the MF from the output file calculated in the workflow for metabolic feature annotation is matched to the MF from the Reference file of metabolites (trimMMD_sortMF.txt or other appropriate reference file). The metabolite information for all matched MFs is added to the input data and output data are generated in three formats, each of which can be saved as tab-delimited files by the user.
An assessment has been made of the workflows' ability to perform putative metabolite identification using two reference files: (i) a listing of unique accurate mass/MF data from PubChem using specific elements (C, H, N, O, P, S, Br, Cl, F, Si only; file downloaded from http://fiehnlab.ucdavis.edu/projects/Seven_Golden_Rules/) selected to give a wider selection of MFs and (ii) The Manchester Metabolomics Database (Brown et al., 2009) constructed with data from genome-scale metabolic reconstructions, HMDB, KEGG, LIPIDMAPS, BioCyc and DrugBank. Data from all these sources are included to provide a comprehensive set of metabolites. For example, HMDB does not contain all lipids that are theoretically present in human biofluids and tissues and therefore inclusion of data from LIPIDMAPS provides greater complementary metabolite coverage.
A clinical dataset of fasting blood serum samples were taken from participants according to ethical guidelines and stored before being analysed with quality control samples in a random order and within 48 h of reconstitution using an UPLC (Waters UPLC Acquity, Elstree, UK) coupled on-line to an electrospray LTQ-Orbitrap hybrid mass spectrometer (ThermoFisher Scientific, Bremen, Germany). The collection and storage of serum samples and the UPLC and mass spectrometer methods applied have been previously described (Dunn et al., 2008; Zelena et al., 2009). Independent samples (118 in total) were analysed in both positive and negative ion mode. Raw data files (.RAW) were converted to the NetCDF format using the File converter program in XCalibur (ThermoFisher Scientific, Bremen, Germany). Deconvolution of data was performed using XCMS, running on R version 2.6.0, an open-source deconvolution program available for LC-MS data (Smith et al., 2006) using identical settings to those reported previously (Dunn et al., 2008). This produced a list of features with associated RT, accurate m/z and chromatographic peak area. For these data, the mass accuracy was assessed using a set of 35 and 50 metabolites commonly detected in serum and plasma in positive and negative ion modes, respectively. Shown in Table 1 is the distribution of annotated features found in the dataset and excluded from further mass matching.
In positive ion mode >50% of features were marked for exclusion from further metabolite identification, which was considerably higher than in negative ion mode (36.9%) due in part to the much greater occurrence of multiply charged ions (~10% of all features). It should be noted that multiply charged ions can be peptides, proteins or high molecular weight metabolites capable of carrying multiple charges. Approximately 25% of these excluded features are isotopic peaks. These are annotated and linked to the related molecular ion. In the workflow for metabolite annotation, all isotope and FT artifact peaks are labelled with the accurate identification observed for the molecular or adduct ions.
Using a mass tolerance of 3 p.p.m., ~6% of the remaining features were not matched to any unique MFs in the PubChem reference file (301507 entries). Seventy-six percent in negative and 60% in positive ion mode of the remaining features (791 and 1292 features) were fully annotated and given putative metabolite identification using a revised version of the MMD database (31648 entries). The reference data file is based on molecules and parent compounds carrying no charge and is derived from the MMD which contains an array of information from a wide variety of electronic resources. The MMD data file was revised by removal (or modification) of charged species of salts e.g. sodium ascorbate, calcium citrate, metamphetamine hydrochloride. Obvious duplicates of data were removed and for many common metabolites e.g. amino acids and sugars only a single stereochemical form of the compound was retained. The included form usually related to the one most well described in HMDB, and if not present in HMDB then as described in KEGG, and if not present in HMDB and KEGG then as described in LIPIDMAPS and resulted in a much cleaner dataset for putative metabolite identification. A fairly stringent mass tolerance of 3 p.p.m. was used in this analysis and in a number of cases the annotation of adducts is based on strong evidence, but the exact mass matching may be outside the allowed tolerance. This is certainly seen for metabolites such as tryptophan where many ions/adducts are detected, some of which are within the 3 p.p.m. mass tolerance and others, particularly K and NaCl/HCOONa adducts of low response, are in the mass error range of 3–10 p.p.m. Increasing the mass tolerance would result in more of these adducts being correctly matched but would greatly increase the overall number of putative metabolite identifications through matching to a greater number of MFs. This is possible if further data are acquired to reduce the number of potential hits (e.g. 13C/12C isotope ratios or MS/MS fragmentation). Additionally, neither reference file had ‘complete’ information relevant to the human metabolome—the MMD is from a wide variety of sources and includes drugs (but not drug metabolites), and the PubChem reference file contains a limited number of elements and limited numbers of atoms per element.
The workflow for correlation analysis processed correlation data in 3–20 min depending on the option selected and the number of features. The workflow for metabolic feature annotation processed the negative ion data in ~3 min for the PubChem reference file. In positive ion mode with 100% more peaks and nearly 5 times as many correlations, the processing time was <20 min. The number of correlations is the rate determining factor and in most cases processing is <30 min and frequently <5 min. The workflow for metabolite annotation matched MF derived from experimental data to MF in MMD rapidly when fewer than 5000 matches were present. This process took just over 1 min to perform in each ion mode. However, when using large reference files such as PubChem, typically up to 30 000 matches, processing time may be of the order of 1 h.
The workflows developed are automated, rapid, open-source and freely available with all features fully annotated or given putative metabolite identifications. The approach is flexible, it is independent of chromatographic deconvolution method and analytical instrument applied. Additional adducts can be added to the adducts file for user-specific instruments, and data and organism-specific metabolite reference files can be used. Two additional differently formatted outputs are available for all matched features and the workflows have the potential to be integrated with other Taverna workflows.
Large numbers of features are recognized as artifact, isotope, salt and multiply charged ions and removed from the metabolite identification process. This in combination with data with good mass accuracy (in raw data or following post-acquisition mass alignment) results in a greatly reduced number of putative metabolite identifications within a specified mass tolerance (typically 3 p.p.m. for the ThermoFisher Scientific LTQ-Orbitrap acquired data). Using this approach, putative metabolite identification does not depend on a high level of experience in dealing with MS data and can be used as a starting point for subsequent definitive identification.
A number of false positives are always found although they are significantly reduced using this approach by annotation of ion type prior to assignment of accurate m/z to MF or metabolite. Wide variation in m/z or RT range or missing values following deconvolution can result in any or all of the following: (i) mass difference can be outside mass tolerance limits (e.g. missed adduct); (ii) RT difference between two features may be outside given value (missed grouping); and (iii) correlation between two features may be below specified limit (missed adduct, missed grouping). Within the software, allowance is made for this, for example, a feature (sodium adduct) that has a correlation with the parent metabolite below the correlation limit will not be grouped with this feature. However, if the m/z is within the mass tolerance it will still be putatively identified as the appropriate sodiated ion. Additional grouping information based on correlation is present in the output file and can assist the user when two or more MFs matches are reported for a feature.
The workflows presented are rapid and high-throughput and greatly reduce the number of false positives by eliminating the inaccurate matching of many artifact, isotope and complex adduct peaks. Subsequent definitive identification employing at least two properties of sample-derived metabolite and an authentic chemical standard (typically RT and fragmentation mass spectrum) can then be performed. Additional information based on similarity measures (e.g. metabolite class or metabolite pathway) are being incorporated into the Manchester Metabolomics Database and will allow in time for further interrogation of the biological changes of interest within microbial, plant and mammalian metabolomic studies. Further developments are planned to amalgamate the separate workflows together and to integrate with separate workflows to increase the applicability and ease of data analysis and interpretation.
Funding: UK Biotechnology and Biological Sciences Research Council (BBSRC) (BBC0082191); The Wellcome Trust (088075/A/08/Z); Johnson and Johnson, Cancer Research UK; The Manchester National Institute for Health Research (NIHR) Biomedical Research Centre.
Conflict of interest: none declared.