|Home | About | Journals | Submit | Contact Us | Français|
Gas chromatography coupled to mass spectrometry (GC-MS) is one of the most widespread routine technologies applied to the large scale screening and discovery of novel metabolic biomarkers. However, currently the majority of mass spectral tags (MSTs) remains unidentified due to the lack of authenticated pure reference substances required for compound identification by GC-MS. Here, we accessed the information on reference compounds stored in the Golm Metabolome Database (GMD) to apply supervised machine learning approaches to the classification and identification of unidentified MSTs without relying on library searches. Non-annotated MSTs with mass spectral and retention index (RI) information together with data of already identified metabolites and reference substances have been archived in the GMD. Structural feature extraction was applied to sub-divide the metabolite space contained in the GMD and to define the prediction target classes. Decision tree (DT)-based prediction of the most frequent substructures based on mass spectral features and RI information is demonstrated to result in highly sensitive and specific detections of sub-structures contained in the compounds. The underlying set of DTs can be inspected by the user and are made available for batch processing via SOAP (Simple Object Access Protocol)-based web services. The GMD mass spectral library with the integrated DTs is freely accessible for non-commercial use at http://gmd.mpimp-golm.mpg.de/. All matching and structure search functionalities are available as SOAP-based web services. A XML + HTTP interface, which follows Representational State Transfer (REST) principles, facilitates read-only access to data base entities.
The identification of the high number of as of yet unidentified metabolic components from GC-MS profiling experiments poses a major challenge in metabolite profiling. Two factors contribute to the high complexity of typical GC-(TOF)-MS experiments. Firstly, as GC-MS inherently requires volatile analytes, metabolites of interest need to be chemically modified, for example by methoxyamination and silylation reagents (Kopka 2006, Lisec et al. 2006). Thus, more than one single analyte per metabolite may be generated and thorough chemical interpretations of observed analytes with respect to their mass spectral and retention index (RI) properties are required. Secondly, a compound library comparison as the most straightforward approach for identifying components from GC-MS analyses relies on the availability of authenticated pure reference substances. Currently, metabolite identification is only possible by a time consuming, manually supervised matching of both the RI information and the reference mass spectra stored in dedicated libraries such as the Golm Metabolome Database (GMD) (Kopka et al. 2005, Schauer et al. 2005), Therefore, the lack of chemically synthesized reference substances and of purified bio-molecules may be seen as the current bottleneck of comprehensive compound identification as identification is only possible if the detected compound is present in the library of references compounds.
In order to provisionally accommodate unidentified compounds, the GMD archives respective Mass Spectral Tags (MSTs). MSTs are defined to represent the combination of chemo-physical properties, namely the mass fragmentation pattern linked to the chromatographic RI information (Wagner et al. 2003). In addition, the GMD also comprises a large compendium of identified compound entries. These entries represent known metabolite structures and are linked to the source (vendor etc.) information of the respective reference substances. Thus, the GMD may represent an ideal resource for the application of supervised machine learning algorithms for compound classification as a means for an automated annotation of MSTs. The GMD compendium may thus be used to enhance the chemical identification process of novel metabolic components discovered by GC-(TOF)-MS based metabolomic screening studies.
Currently, most novel GC-MS based analytical signals remain unidentified as there is no reference substance available. Of the total of up to 1,000 MSTs observed in typical studies, only 50–150 metabolites can be identified. The determination of the chemical sum formula associated with molecule peak and electron-impact induced fragment peaks may be feasible. The unambiguous mass spectral interpretation, however, is in many cases only possible for small molecules bearing a single functional group (Varmuza and Werther 1996). For this reason, hit list based mass spectra similarity matching has evolved as a highly successful tool for the routine assessment of mass spectra (Halket et al. 2005), and large commercial mass spectral libraries, such as NIST08 (http://chemdata.nist.gov/mass-spc/Srch_v1.7/index.html) have been developed. The employed matching approaches use various similarity scores, which were developed in conjunction with specifics of the gas chromatographic and mass spectrometric technology (Crawford and Morrison 1968, Stein 1999, Stein and Scott 1994). However, a reliable automation of mass spectral matching has not been accomplished so far. Tools utilising RI information for the matching process adapted for the processing of large metabolite profiling experiments, such as TagFinder (Luedemann et al. 2008), recommend manually supervised compound identification.
Beyond similarity-based matching of mass fragmentation patterns, algorithmic approaches that relate mass spectral features to structural properties have been pursued. For example, several low resolution mass spectral classifiers have been reported to yield promising results. Werther et al. (1994) successfully applied diverse computational classification techniques to the recognition of simple structural moieties. These authors tested the prediction of 10 structural characteristics and found neural networks to be superior compared to k-nearest neighbour (KNN) classification, linear discriminant analysis, or principal component models. Furthermore, Varmuza and Werther (1996) presented an enhanced approach based on random sampling of training mass spectra according to predefined spectral features and the subsequent application of multivariate classification methods or neural networks. Approximately 160 spectral classifiers were developed that cover a significant portion of organic chemistry, but they may only be partially applicable to bio-molecules, such as the primary and secondary metabolites. These spectral classifiers are now part of the NIST software. Since version 2.0, the software package Mass Frontier™ supports the classification of mass spectra (http://www.highchem.com/new-features/), utilizing three classification methods: Principal Component Analysis (PCA), Fuzzy Clustering, and Self-Organizing Maps (SOM) (http://www.highchem.com/mass-spectra-classification/) (Steiner et al. 2002). Further enhanced approaches attempt to include isomer prediction from given sum formulas (Varmuza 2001).
Progress has also been made for the prediction of the presence or absence of substructures based on mass spectral features. Mass spectral classifiers were implemented using linear discriminant analysis, LDA, and partial least squares discriminant analysis, PLS-DA (Yoshida et al. 2001), or by selection of characteristic mass fragment combinations (Tang et al. 2003). In view of these successful approaches, we developed a decision tree (DT) based classifier and a web-based interface dedicated to the specific metabolomic needs of GC-MS-based profiling. In contrast to prior general organic chemistry-based efforts, we focus on metabolites and substructures of metabolic origin. Furthermore, we combine the chromatographic RI information (Strehmel et al. 2008) with mass spectral features for classification and substructure prediction. Most importantly, we chose the DT algorithm to solve the classification problem. This algorithm is employed for the recognition of patterns in mass fragmentation spectra that distinguish classes of compounds which either contain or lack a specific predefined chemical moiety. The DTs are made available via the augmented web-interface of the GMD as well as web-services to assist in the annotation of metabolomics data sets.
The GMD uses a Microsoft SQL Server 2008™ as the relational database backend for relating the mass spectrum and retention behaviour to an analyte, i.e. the chemically modified compound, which is mapped to represent a metabolite (Fig. 1) (Hummel et al. 2008). Both analyte and metabolite have the properties of a chemical compound and are linked to structures archived as .mol-files and InChI™ codes (http://www.iupac.org/inchi/). A typical metabolite has one to two analytes, which are generated by the chemical derivatization process inherent to the GC-MS profiling technique. Each analyte has multiple technological versions of MSTs. These replicate mass spectra and RIs are empirically determined using different mass spectral technologies, e.g. time of flight, quadrupole or ion trap based mass detectors, and variations of gas chromatographic systems (Strehmel et al. 2008).
In the current GMD release, 6,187 mass spectra are available representing 2,444 analytes and 1,535 metabolites. It should be noted that the GMD compendium is biased towards GC-MS accessible, stable, primary metabolites. Therefore, the structural moieties of the metabolite classes, amino acids, organic acids, fatty acids, fatty alcohols, sugars, sugar alcohols and respective conjugates dominate. Structural annotations are in most cases stereo-chemically correct, even though routine GC-MS profiling (Lisec et al. 2006, Wagner et al. 2003) allows only the differentiation of anomeric, epimeric structures and E/Z-geometric isomers.
A supervised machine learning approach using a pattern recognition algorithm was chosen to infer correlations between the sub-structure properties of known compounds and the properties of respective MSTs. For every considered functional group, we classify MSTs to belong to either the functional group containing or non-containing groups. Thus, we perform a binary classification. The DT method was applied, because multiple parameter types, categorical and numerical, can be integrated and no assumptions about numerical parameter distributions and about the nature of discriminating functions, e.g. linear, non-linear, multimodal, are required. Secondly, in contrast to the NIST Mass Spectrum Interpreter software for substructure analysis, rules comprised of single feature decisions are returned, which describe the criteria of the mass spectral classification process and are suitable for interpretation by a GC-MS expert.
The checkmol program (Feldman et al. 2005) was executed to automatically extract the 21 most abundant structural features, e.g. substructures or chemical moieties, from the metabolite structures of the GMD (>3% occurrence). Subsequently, we created DTs using mass spectral and retention properties as predictors of the structural features.
The mass spectra used to train the DT algorithm were electron impact spectra of methoxyaminated and trimethylsilylated reference compounds with known structures, natural isotopomer composition, and documented reference compound sources.
It should be noted that—in the current release—we use the MSTs of chemically derivatized metabolites for the analysis of structural features present in non-derivatized metabolite structures. As the DT algorithm supports this approach, we reasoned that the biologist and GMD user’s interest lies more on the metabolite structure rather than the methoxyaminated and trimethylsilylated compounds inherently required for GC-MS based metabolite profiling.
DT training was performed separately for each considered structural feature. For this procedure, the mass spectral compendium of the GMD was divided into those mass spectra associated with metabolites containing the respective structural feature and those in which the structural feature was absent. DT training was performed with and without using the RI information linked to each MST. In order to use the RI information, a subset of training data with empirically determined RIs was created. The supported RI models are based on standardization by 9 n-alkanes (C10–C36) and either a 5%-phenyl-95%-dimethylpolysiloxane capillary column, in short VAR5, or a 35%-phenyl-65%-dimethylpolysiloxane column (MDN35, Lisec et al. 2006). RI information of 8 variant VAR5 chromatography methods was converted according to Strehmel et al. (2008).
A mass spectrum can be considered a point in an n-dimensional mass space with n representing the number of individual masses/charge fragment ratios as coordinate axes with associated values corresponding to fragment abundances. In order to characterise mass spectra in terms of those properties, which are potentially relevant for the structural distinction of chemical substance classes, additional spectral features have been proposed, e.g. weighted abundance of single masses, intensities of single masses normalised to the local ion current, averaged intensities of mass intervals, logarithmic transformations, modulo-14 summation, autocorrelation properties, so-called spectrum type features, and characteristic peak series features (Varmuza 2001, Xu et al. 2003).
As DT methods allow the combined use of diverse properties, we extracted three types of spectral features in addition to the above mentioned RI information. (1) Logarithmic intensities of nominal masses, as proposed previously (cf. above), were used in the mass range m/z 70–600. However, only 525 spectral features (“intensity-lg”) were allowed after exclusion of ubiquitous mass fragments typically generated from compounds carrying a trimethylsilyl-moiety, namely the fragments at m/z 73, 74, 75, 147, 148, and 149. (2) For an improved feature construction with better discriminative potential (Kotsiantis et al. 2006), we implemented the full set of logarithmic pair-wise intensity ratios, thus, adding 137,550 spectral features to the DT training input space (“ratio lg”). (3) In addition to these features, we used a mass distance measure to represent the mass losses of typical electron impact induced fragmentation reactions. Mass distances caused by the naturally occurring 13C-isotopic patterns were excluded. This spectral processing provided 524 additional spectral features (“peak distance”) per MST. In detail, only those mass fragments associated with a local intensity maximum were considered for peak distance calculation, whereas flanking mass fragments with descending intensities at (m/z) − 1(2, 3, …) or (m/z) + 1(2, 3, …) were removed before calculating the peak distance matrix.
In summary, MSTs were pre-processed to obtain those spectral features best representing the probability that a specific fragment is generated from a given compound (intensity-lg, ratio lg) and the mass differences between fragments indicative of the typical cleavage reactions of chemical moieties. Both types of information were used to train DTs with or without the use of RI information.
Using the Microsoft SQL 2008 Server Analysis Services™, DTs were trained for selected single structural features. Because an SQL Server table is limited to 1,024 columns, the predictor variables had to be pre-selected. We used the Fisher ratio, Fr, for ranking the variables, with
In total, 138,599 Fisher scores were computed for each functional group for the evaluation of the respective discriminative power of all available mass spectral features. The 1,000 highest scoring spectral features were chosen for each prediction task. When multiple feature types were used for DT training, features of each type were selected in equal proportions.
The final training set submitted to the DT algorithm comprised 1,004 columns with 1,000 columns of the pre-selected best scoring mass spectral features. Two columns containing the optional RI-related information from the VAR5- and MDN35-RI systems were added, while one column contained the present or absent call of the structural feature under investigation. The forth column comprised the primary key reference to the respective mass spectrum entry within the GMD. Three DT training procedures were performed, generating a DT without RI information, and two DTs with RI information of either the VAR5 or the MDN35 chromatographic systems. For DTs with RI information, only those MSTs with available RI information were considered.
The DT models including selected features, transformations and other pre-processing details were saved to the server for subsequent application to user submitted MSTs of unknown structure. In the current build, the DT algorithm of the Microsoft SQL Server Analysis Services™ was parameterized according to default recommendations (cf. Table 1). The minimum node support was set to 10 spectra (tree expansion break off criterion).
The prediction performance was assessed by the precision (p) and recall (r) measures, with
TP, FP, TN, and FN define true positive, false positive, true negative and false negative predictions, respectively.
For a combined characterization of precision and recall, the Fpr-value with
was computed as a frequently used performance measure in the field of information retrieval (van Rijsbergen 1979). Fpr-values of 1 indicate optimal, while values approaching zero correspond to minimal prediction performance.
Matthews correlation coefficient (MCC) is commonly used for the assessment of binary classifications and was shown to be robust with regard to imbalanced class distributions (Matthews 1975). MCC can be computed from the contingency table according to Eq. 5:
MCC values range from −1 (perfect inverse prediction) to +1 (perfect prediction). A coefficient of 0 represents an average random prediction. The error rate obtained in cross validation (CV), ErCV, is computed as
and should approach zero with increasing DT quality.
We developed DT-based substructure prediction as a potentially powerful tool box for the structural characterisation of the numerous non-identified MSTs that are encountered in routine GC-MS based metabolite profiles. To enable evidence based substructure prediction we utilized the rich resource of mass spectra and RI information of authenticated reference compounds from the GMD. Application of supervised machine learning approaches required updating of the GMD with structure information of the contained metabolites. This added information now allows binary partitioning of the known metabolites into training data that either contain or do not contain the assessed substructure. The application of supervised machine learning algorithms, such as the DT classification, now supports the in silico characterisation of yet non-identified MSTs that are frequently recognised as relevant marker molecules by non-targeted metabolite profiling. In order to define and compare the performance of the chosen DT-classification approaches for the potential users of the GMD web site and the offered web services (cf. Sect. 3.2) we report in the following the implemented CV procedures of the provided DTs, the respective feature usage and finally assess typical application cases.
For the characterization and comparison of DT performance, we implemented measures based on the subtotal TP, FN, FP, and TN prediction of the CV contingency table. These measures allow the assessment of alternative DTs for identical substructures, e.g. DTs with or without use of RI information. Also the quality of DTs was made comparable between different substructure predictions as two general classification errors exist, (1) deficiencies resulting from imperfect MST training data, and (2) deficiencies due to imbalanced training data. Both potential errors lead to over-fitted DTs and may compromise substructure predictions.
For the reasons stated above a 50-fold CV by iterative exclusion of randomly chosen MSTs was routinely implemented. As an alternative we explored a heuristic validation process by excluding all technological replicate MSTs of single analytes (data not shown). As GMD will provide a steadily increasing number of replicate mass spectra and RIs of each analyte that is expected to improve DT-based substructure prediction we decided to use random choice of MSTs for CV. Our implemented procedure characterises DT performance and enables the calculation of overall error estimates for each substructure tree. CV results are displayed along with the respective DT information and visualization (Fig. 2). In order to demonstrate the implementation of the chosen CV process, exemplary results of a ten-fold CV are shown for the DT-classification which assesses the presence or absence of an amine moiety (Table 2).
Using a Precision-Recall-Plot (Fig. 3), we evaluated precision and recall in relation to an Fpr-measure threshold = 0.65 using the most recent set of DTs which includes the RI information based on standardization by 9 n-alkanes (C10–C36) and a 5%-phenyl-95%-dimethylpolysiloxane capillary column, in short VAR5 (Strehmel et al. 2008).
Precision and recall measures were high for almost all tested substructures. Only the three DTs targeting the prediction of heterocycle, hemiacetal, and carboxylic acid ester substructures resulted in slightly inferior DT-classifications. These substructures are, hence, excluded from for public use and will be targeted by future efforts to enhance the DT applied algorithm and by extending the GMD compendium by suitable training data. Since the hard to classify substructures were also those with the smallest number of available training data, we expect that extended data resources within GMD will immediately result in improved performance of substructure prediction.
Table 3 summarizes the MST feature usage within the current set of DTs compiled from the May 2009 GMD version including the Var5 RI information as input variables. All DTs were generated independently, i.e. both data processing and feature pre-selection was performed separately for each substructure prior to DT training. As expected, the features incorporated in the DT vary considerably between the different predicted substructures. The usage of characteristic mass spectral fragments (m/z) agrees with the chemical nature and hierarchy of the investigated biochemical moieties. The repeated use of characteristic mass fragments for similar substructures is apparent. For example, the fragment m/z = 99 is consistently used for all amine-like substructures, m/z = 103 in the case of alcohol-like substructures and m/z = 160 for carbonyl-like substructures. These mass fragments may be termed canonical masses which result from fragmentation reactions that are typical of compounds belonging to certain chemical classes. For example, the mass fragment m/z = 160 represents methoxyaminated aldehyde moieties, which are characteristic for reducing aldose-sugars. The fragment m/z = 103 is an abundant and typical cleavage product of trimethylsilylated primary alcohols, such as non-cyclic sugar and polyol molecules. While m/z = 103 represents the cation [.CH2O(TMS)+] and m/z = 160 the cation [C=NOCH3–CH2O(TMS)+], the source and usage of m/z = 99 for the prediction of amines is not yet fully understood.
Considering the available choice of the numerous preselected 1,001 variables (a maximum of 1,000 spectral features plus one optional RI information), the DT classification uses only a comparatively small and specific number (<14) of selected features per DT (cf. summary row*1) in Table 3). This small number and the frequent choice of specific features represent an additional safe-guard against the risk of DT over-fitting. The chemical analysis and interpretation of feature usage and the analysis of the surprising absence of “ratio-lg” criteria from the current DTs is in progress but was deemed beyond the scope of this study.
In order to characterize the potential, but also the caveats of substructure predictions using the DT algorithm provided by GMD, we performed typical application cases. New or non-identified metabolites will—in most cases—be discovered as automatically deconvoluted mass spectra from profiles of complex biological samples. Automated mass spectral deconvolution represents the typical solution to the task of analysing GC-MS profiles of highly complex biological matrices; here we use the term matrix to refer to the sum of all monitored metabolites from a biological sample. Under such conditions the available deconvolution algorithms may remove specific mass fragments either because of low compound abundance, resulting in a low signal-to-noise ratio (S/N), or due to chromatographic peak shape artefacts. Alternatively, mass fragments belonging to chromatographically co-eluting compound(s) may be incorrectly added.
Because of the ease of automated deconvolution, the experimental scientist might be tempted to base an initial structural elucidation attempt on such potentially compromised MSTs. To demonstrate the risk of such an approach, we selected four compounds derived from two complex plant matrices, namely potato tuber and rice leaf. The compounds glucose (1MeOX 5TMS), citric acid (4TMS), valine (2TMS) and putrescine (4TMS) were chosen to represent carbohydrates, organic acids, amino acids, and amines as typical metabolite classes. These metabolites contain most of the frequently occurring metabolite substructures, which have been targeted by our DT classification approach. Figure 4 shows the metabolite structures before chemical derivatization by methoxiamination (carbonyl modifying) and trimethylsilylation (substituting protons bound to heteroatoms).
Our application cases convincingly demonstrate that many expected substructures are recognized with high reproducibility (Table 4). Nevertheless, clear differences with regard to the biological source or concerning single automated deconvolutions from identical sources exist. For example, the primary alcohol substructure is only recognized in part of the deconvolutions representing glucose and fructose, whereas the more generalized substructures containing OH-moieties are diagnosed with high repeatability. In addition, the alpha-amino acid and the more general amine-substructure were recognized with varying reproducibility, in some cases only in a small part of the automated deconvolutions.
In conclusion, automated mass spectral deconvolutions should only be considered with care and for a preliminary substructure assessment. We recommend the use of manually curated mass spectra and a statistical evaluation of multiple high quality mass spectra best obtained from multiple biological sources or from at least two different chromatographic systems. Specialized laboratories may avoid many GC co-elution artefacts by application of a two dimensional (GCxGC)-TOF–MS system.
All DTs developed as part of this study have been made freely available to academic users for spectra-based compound annotation at http://gmd.mpimp-golm.mpg.de/. For automated batch processing, the developed platform-independent Simple Object Access Protocol (SOAP) based web service endpoint wsPrediction provides public access to the functionality presented here.
The mass spectrum and RI compendium of the GMD has been used as a training data set for a supervised machine learning approach using a DT algorithm for the classification of MSTs and the retrieval of human-interpretable classification rules. The new GMD frontend provides a rich set of substructure classification models comprising mass spectral patterns and optional extensions including RI information, which group MSTs with common substructures. The offered DTs are provided as an extension to the conventional hit list based mass spectral matching approach and can be used to support the interpretation of MSTs from known metabolites and also facilitate the classification of those MSTs, which are not yet identified. The classification tools of the GMD frontend can be updated with the continuously growing set of GMD entries. The success of updating efforts can be assessed using DT cross validation (CV) parameters, such as precision, recall, Fpr-measure, MCC, and the CV error, which have been implemented in the course of this project to compare DT performance and to reject weak prediction models. Hence, this new web interface and application may contribute to the evidence-based classification of non-identified MSTs and follows the general recommendations of the metabolomics standards initiative for reporting standards for chemical analysis (Sumner et al. 2007).
The DTs presented in this work depend on the continued curation and enhancement of the GMD content. Specifically, residual deconvolution errors will be removed, the spectral quality improved and the number of high quality replicate spectra for existing MSTs extended. Most importantly, new metabolites will be added to the GMD compendium. As a consequence, these efforts will necessitate an updating scheme for the DT substructure predictions and evaluation of DT performance will become a frequent use case.
Furthermore, the extension towards DT analyses of those substructures, which are underrepresented in the current GMD dataset, appears to be an attractive goal. Finally, given the availability of multiple DTs for the prediction of one particular substructure, the application of DT forests may be worthwhile.
The authors acknowledge the long standing support and encouragement by Prof. L. Willmitzer, Prof. M. Stitt and Prof. R. Bock, Max Planck Institute of Molecular Plant Physiology (MPI-MP), Am Muehlenberg 1, D-14476 Potsdam-Golm, Germany. The authors thank Dr. D. Steinhauser, Dr. A. R. Fernie, A. Erban, I. Fehrle, J. Hannemann and M. Kuczmierczyk for the generation of metabolite structures and the interactive discussions during project realization.
Funding This work was supported by the Max Planck Society, the QuantPro program of the Bundesministerium für Bildung und Forschung (BMBF), sub-project “InnOx—Innovative diagnostic tools to optimise potato breeding: Systematic analysis of cellular processes and their relation to plant internal oxygen concentrations”, FKZ 0313813A, and the European META-PHOR project, FOOD-CT-2006-036220.
Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.