|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: JK KS FJT GK. Performed the experiments: AME MWM RPM MVM. Analyzed the data: JK GK. Contributed reagents/materials/analysis tools: BW WR-M TI JA CG. Wrote the paper: JK KS FJT GK.
Recent genome-wide association studies (GWAS) with metabolomics data linked genetic variation in the human genome to differences in individual metabolite levels. A strong relevance of this metabolic individuality for biomedical and pharmaceutical research has been reported. However, a considerable amount of the molecules currently quantified by modern metabolomics techniques are chemically unidentified. The identification of these “unknown metabolites” is still a demanding and intricate task, limiting their usability as functional markers of metabolic processes. As a consequence, previous GWAS largely ignored unknown metabolites as metabolic traits for the analysis. Here we present a systems-level approach that combines genome-wide association analysis and Gaussian graphical modeling with metabolomics to predict the identity of the unknown metabolites. We apply our method to original data of 517 metabolic traits, of which 225 are unknowns, and genotyping information on 655,658 genetic variants, measured in 1,768 human blood samples. We report previously undescribed genotype–metabotype associations for six distinct gene loci (SLC22A2, COMT, CYP3A5, CYP2C18, GBA3, UGT3A1) and one locus not related to any known gene (rs12413935). Overlaying the inferred genetic associations, metabolic networks, and knowledge-based pathway information, we derive testable hypotheses on the biochemical identities of 106 unknown metabolites. As a proof of principle, we experimentally confirm nine concrete predictions. We demonstrate the benefit of our method for the functional interpretation of previous metabolomics biomarker studies on liver detoxification, hypertension, and insulin resistance. Our approach is generic in nature and can be directly transferred to metabolomics data from different experimental platforms.
Genome-wide association studies on metabolomics data have demonstrated that genetic variation in metabolic enzymes and transporters leads to concentration changes in the respective metabolite levels. The conventional goal of these studies is the detection of novel interactions between the genome and the metabolic system, providing valuable insights for both basic research as well as clinical applications. In this study, we borrow the metabolomics GWAS concept for a novel, entirely different purpose. Metabolite measurements frequently produce signals where a certain substance can be reliably detected in the sample, but it has not yet been elucidated which specific metabolite this signal actually represents. The concept is comparable to a fingerprint: each one is uniquely identifiable, but as long as it is not registered in a database one cannot tell to whom this fingerprint belongs. Obviously, this issue tremendously reduces the usability of a metabolomics analyses. The genetic associations of such an “unknown,” however, give us concrete evidence of the metabolic pathway this substance is most probably involved in. Moreover, we complement the approach with a specific measure of correlation between metabolites, providing further evidence of the metabolic processes of the unknown. For a number of cases, this even allows for a concrete identity prediction, which we then experimentally validate in the lab.
Recently, genome-wide association studies (GWAS) on metabolic quantitative traits have proven valuable tools to uncover the genetically determined metabolic individuality in the general population –. Interestingly, a great portion of the genetic loci that were found to significantly associate with levels of specific metabolites are within or in close proximity to metabolic enzymes or transporters with known disease or pharmaceutical relevance. Moreover, compared to GWAS with clinical endpoints the effect sizes of the genotypes are exceptionally high.
The number and type of the metabolic features that went into these GWAS was mainly defined by the metabolomics techniques used: Gieger et al.  and Illig et al.  used a targeted mass spectrometry (MS)-based approach giving access to the concentrations of 363 and 163 metabolites, respectively. Suhre et al.  and Nicholson et al.  applied untargeted nuclear magnetic resonance (NMR) based metabolomics techniques, yielding 59 metabolites that had been identified in the spectra prior to the GWAS and 579 manually selected peaks from the spectra, respectively. In Suhre et al. , 276 metabolites from an untargeted MS-based approach were analyzed.
While these previous GWAS focused on metabolic features with known identity, untargeted metabolomics approaches additionally provide quantifications of so-called “unknown metabolites”. An unknown metabolite is a small molecule that can reproducibly be detected and quantified in a metabolomics experiment, but whose chemical identity has not been elucidated yet. In an experiment using liquid chromatography (LC) coupled to MS, such an unknown would be defined by a specific retention time, one or multiple masses (e.g. from adducts), and a characteristic fragmentation pattern of the primary ion(s). An unknown observed by NMR spectroscopy would correspond to a pattern in the chemical shifts. Unknowns may constitute previously undocumented small molecules, such as rare xenobiotics or secondary products of metabolism, or they may represent molecules from established pathways which could not be assigned using current libraries of MS fragmentation patterns ,  or NMR reference spectra .
The impact of unknown metabolites for biomedical research has been shown in recent metabolomics-based discovery studies of novel biomarkers for diseases and various disease-causing conditions. This includes studies investigating altered metabolite levels in blood for insulin resistance , type 2 diabetes , and heart disorders . A considerable number of high-ranking hits reported in these biomarker studies represent unknown metabolites. As long as their chemical identities are not clarified the usability of unknown metabolites as functional biomarkers for further investigations and clinical applications is rather limited.
In mass-spectrometry-based metabolomics approaches, the assignment of chemical identity usually involves the interpretation and comparison of experiment-specific parameters, such as accurate masses, isotope distributions, fragmentation patterns, and chromatography retention times –. Various computer-based methods have been developed to automate this process. For example, Rasche and colleagues  elucidated structural information of unknown metabolites in a mass-spectrometry setup using a graph-theoretical approach. Their approach attempts to reconstruct the underlying fragmentation tree based on mass-spectra at varying collision energies. Other authors excluded false candidates for a given unknown by comparing observed and predicted chromatography retention times , , or by the automatic determination of sum formulas from isotope distributions . Furthermore, Gipson et al.  and Weber et al.  integrated public metabolic pathway information with correlating peak pairs in order to facilitate metabolite identification. However, these methods might not be applicable for high-throughput metabolomics datasets that have been produced in a fee-for-service manner, since the mass spectra as such might not be readily available.
Approaching the problem from a conceptually different perspective, we here present a novel functional metabolomics method to predict the identities of unknown metabolites using a systems biological framework. By combining high-throughput genotyping data, metabolomics data, and literature-derived metabolic pathway information, we generate testable hypotheses on the metabolite identities based solely on the obtained metabolite quantifications (Figure 1). No further experiment-specific data such as retention times, isotope patterns and fragmentation patterns are required for this analysis.
The concept of our approach is based on the following observations from our previous work on genome-wide association studies and Gaussian graphical modeling (GGM) with metabolomics: We showed that GWAS with metabolic traits can reveal functional relationships between genetic loci encoding metabolic enzymes and metabolite concentration levels in the blood –, . A genetic variant can alter, for instance, the expression levels of mRNAs or affect the properties of the respective enzymes through changes of the protein sequence (e.g. enzyme activity, substrate specificity). Moreover, we found that GGMs, which are based on partial correlation coefficients, can identify biochemically related metabolites from high-throughput metabolomics data alone , . These observations suggest that if an unknown compound displays a similar statistical association with a genetic locus in a GWAS or a known metabolite in a GGM, then this may provide specific information of where it is located in the metabolic network. Based on this information we can then derive testable hypotheses on the biochemical identity of the unknown metabolite. This annotation idea parallels classical concepts from functional genomics, where, for instance, co-expression between RNA transcripts is used to predict the function of poorly characterized genes , .
The manuscript is organized as follows: We first conduct a full genome-wide association study on 655,658 genotyped SNPs with concentrations of 225 unknown metabolites using fasting blood serum samples from a large German population cohort (n=1768) . We thereby extend our previous work on known metabolites  to a GWAS with hitherto unpublished unknown metabolic traits. We then compute a Gaussian graphical model including both known and unknown metabolites. In a third step, we integrate the results of the GWAS and GGM computations and combine them with metabolic pathway information from public databases to derive predictions for a total of 106 unknown metabolites. In order to validate the approach, we investigate six distinct cases, in which we derive specific identity predictions for a total of nine unknown metabolites, which we then confirm experimentally. Finally, we discuss the relevance of newly discovered genetic loci and unknown identity predictions in the context of existing disease biomarker discovery and pharmacogenomics studies.
All GWAS and GGM results, unknown metabolite classifications and pathway annotations are available as spreadsheets and in .graphml format in Dataset S1 or from our study website at http://cmb.helmholtz-muenchen.de/unknowns.
In the first step of our analysis, we conducted a GWAS with the concentrations of known and unknown metabolites, testing a total of 655,658 genotyped SNPs from the KORA cohort for association. Thus, in addition to the unknown metabolite data, we included the association data for known metabolites from our previous study  into the present analysis. Unknown metabolites are uniquely labeled in the format “X-12345”, which are identical throughout all published studies that use the Metabolon platform.
In total, we observe 34 distinct loci that display metabolite associations at a genome-wide significance level (Figure 2 and Dataset S1). Out of these 34 loci, 15 associate with at least one unknown compound. For 12 loci, an unknown compound constitutes the strongest association of all tested compounds. From the 213 unknown metabolites analyzed (see Methods for the determination of this metabolite subset), 28 show at least one genome-wide significant hit. These 28 associations at the 15 loci are presented in Table 1 along with all previously described GWAS hits to metabolic traits or other endpoints. Associating traits were determined from the GWAS catalog  for SNPs in LD (r2≥0.5) with the respective lead SNP. Seven of the 15 loci (SLC22A2, COMT, CYP3A5, CYP2C18, GBA3, UGT3A1, rs12413935) have not been described in GWAS with metabolic traits before and thus represent new genetic loci of metabolic individuality. Interestingly, genetic variants in strong LD with CYP2C18 have been reported to associate with warfarin maintenance dose .
In our previous GWAS with metabolic traits, we observed that metabolites associating with genetic variants in or near enzymes are likely to be functionally linked to these proteins. A SNP with detectable effects on the metabolome will, for instance, alter expression levels of mRNAs, or affect the properties of the respective enzymes (e.g. enzyme activity, substrate specificity) through modifications of the protein sequence. As an example of the latter case, the SNP rs4343 in the angiotensin converting enzyme (ACE) encoding gene was found to be associated with the activity of the enzyme  (See Table 1). To estimate the contribution of the first case, we compared our significant SNPs with expression quantitative trait loci (eQTLs) from published GWAS with expression levels. To this end, we queried the Genotype-Tissue Expression (GTEx) browser, an online eQTL database of the NIH GTEx roadmap project, which stores eQTL results for multiple human tissues (liver, lymphoblastoid, brain) . For seven SNPs in three distinct loci (PYROXD2, CYP3A5, SPPL3), we found significant cis-eQTLs (p-value<2.7×10−9, see Methods) in GTEx. All identified eQTLs with p-values below 10−5 are listed in Dataset S1.
Based on the observation that SNPs in or in the vicinity of enzymes are mostly associating with functionally related metabolites in case of the knowns, we used the GWAS data to derive hypotheses on the potential identity of the respective unknowns. For instance, the SNP rs296391 in close proximity to the SULT2A1 gene (sulfotransferase family, cytosolic, 2A, dehydroepiandrosterone DHEA-preferring) strongly associates with the concentrations of the unknown metabolites X-11440 and X-11244 (p=1.7×10−43 and p=2.1×10−26, respectively). The enzyme encoded by SULT2A1, a bile salt sulfotransferase, converts steroids and bile acids into water-soluble sulfate conjugates for excretion . Thus, we may speculate that X-11440 and X-11244 are biochemically related to steroids, bile acids, or water-soluble sulfate conjugates. Additional insights can be gained from genetic associations that involve both known and unknown metabolites. For instance, X-12510, X-11787, X-12093 and N-acetylornithine strongly associate with genetic variation at the NAT8 locus. NAT8 encodes the protein N-acetyltransferase 8. In this case, we may speculate that the unknowns represent similar substrates or products of the N-acetylation processes linked to this enzyme. Finally, we can link the results obtained here with results from other GWAS on metabolic traits. For example, the unknown metabolite X-13431 associates with a genetic variant in the ACADL (acyl-CoA dehydrogenase, long-chain) gene. This locus does not associate with any other metabolite in the present study, but was previously reported to associate with the medium-chain length carnitines C9 and C10:1 , . Proteins from the ACAD family catalyze rate-limiting reactions in the β-oxidation pathway which generally associate with carnitines. This observation suggests that X-13431 may be a member of this medium-chain length carnitine family. These examples demonstrate that concrete information on the biochemical identity of unknown metabolites can be derived from our experimental dataset by using the GWAS approach.
In the second step of our analysis we focused solely on intrinsic relations between the measured metabolites and, in particular, on associations between known and unknown compounds. To this end, we applied Gaussian graphical models (GGMs), which we have previously shown to be able to reconstruct pathways involving directly related metabolites from cross-sectional blood serum metabolomics data , . GGMs are based on partial correlation coefficients, that is, correlations between pairs of metabolites corrected for the effects of all remaining metabolites. Each known metabolite is annotated with a “super-pathway” corresponding to its general metabolic class, and a “sub-pathway” representing more specific metabolic pathways (see Dataset S1). In order to obtain a dataset that is independent of our genetic analysis, and to avoid circular arguments, co-variations in metabolite concentrations that are due to association with genetic variants (SNPs) were specifically removed from the data (see GGM methods for further details). A partial correlation was included in the model if it was significantly different from zero with α=0.05 after Bonferroni correction, yielding a corrected significance level of =7.9×10−7 and an absolute partial correlation cutoff of ζ=0.178. The resulting GGM consists of a total of 399 out of 62,835 theoretically possible edges (0.64% connectivity, Figure 3A). In line with our previous observations , metabolites tend to be strongly connected within their respective metabolic class, while links between different classes are rare (see Text S1). Inspecting the GGM in detail, we observe that the unknowns are tightly integrated within the network and connected to known compounds of various metabolic classes. This is reflected both in the overall network (Figure 3A, Text S1) and in the top list of high-scoring GGM edges (Table 2), where 18 of the 30 strongest partial correlations comprise at least one unknown metabolite. The highest partial correlation in the dataset actually involves a known-unknown metabolite pair, namely 3-indoxylsulfate and the unknown metabolite X-12405 (ζ=0.840). For pairs of known metabolites, we consistently observe associations of biochemically related metabolites from various metabolic pathways, such as the metabolites inosine and guanosine (ζ=0.798), which are involved in nucleotide metabolism, or androsterone sulfate and epiandrosterone sulfate (ζ=0.755), which represent related steroid hormone metabolites. Other pathways with related metabolite pairs include amino acid metabolism, lipid metabolism, bile acid metabolism, and xanthine metabolism. Following our line of reasoning, correlating pairs of a known and an unknown metabolite then directly point to specific pathways of cellular metabolism on which the unknown metabolite may lie. The investigation of the sub-network structure around the unknown compounds provides additional biochemical context for that compound.
We selected four high-scoring sub-networks in the GGM to show that this concept is indeed applicable to real data. The first two of these sub-networks consist of a series of intermediate compounds from purine metabolism, including guanosine, inosine, xanthine derivatives and urate (Figure 3B and 3C). In these cases, one can actually follow the addition and removal of chemical groups by following the edges in the GGM network: Most edges in these sub-networks correspond to the change of either a single methyl group at the purine double-ring structure or to the removal of a ribose residue in the reaction from nucleosides to xanthine variants. While the compounds in both sub-networks appear structurally similar, the distinction into two groups by the GGM is indeed biochemically sound. The metabolites in Figure 3B correspond to endogenous substances in the nucleoside pathway, whereas the molecules in Figure 3C relate to signals from xenobiotic metabolism of drugs and caffeine. Here, the unknown metabolites X-11422 and X-10810, as well as X-14473 and X-14374 are prominently placed in the networks, making them direct targets for closer inspection with respect to endogenous xanthines and xenobiotics, respectively.
The third sub-network comprises three androsterone sulfate variants, which belong to the class of steroid hormones (Figure 3D). We observe direct GGM links between the unknowns X-11450, X-11244 and X-11443 with both dehydroepiandrosterone sulfate (DHEAS) and epiandrosterone sulfate, suggesting androsterone derivatives as likely candidates for these three metabolites. The fourth sub-network involves different stereoisomers of bilirubin, which is the degradation product of the oxygen transporter hemoglobin  (Figure S1). In this sub-network, we observe high partial correlations between the bilirubin variants and a series of unknown metabolites (X-11441, X-11530, X-11442, X-11793, X-11809, X-14056, and X-14057). The seven unknown compounds in this GGM sub-network are thus likely to be involved in hemoglobin degradation processes. Taken together, the examples confirm that further information on the biochemical identity of unknown metabolites can be extracted from GGM networks.
The next step in our analysis was the integration of the GGM and GWAS approaches with general pathway information from external databases, in order to generate concrete predictions for the unknowns' metabolic pathway memberships. As a feasibility test, we first asked whether the local neighborhood of a metabolite in the GGM can be used to correctly predict its metabolic class. Using a majority-voting based classifier and subsequent permutation testing, we detected significant classification abilities (mean sensitivity 0.674, mean specificity 0.84, macro-averaged F1 score 0.72) far beyond random (p<10−8,). Detailed results can be found in Text S2. Note that we performed this approach only to demonstrate the systematic possibility to derive functional information from the GGM. The actual classification of the unknowns in the following will not be based on majority voting, but rather on the collection of all available functional information from GGM neighbors and GWAS hits.
We combined functional annotations for both GGM neighbors and GWAS hits for each unknown in order to derive specific pathway classifications. For unknowns that did not have a known metabolite neighbor in the GGM, we also investigated the 2- and 3-neighborhoods. Since these hits certainly represent weaker evidence than a direct GGM neighbor, we distinguish between ‘GGM hit’ and ‘direct GGM hit’ in the following. Functional annotations were obtained from three sources: (1) The sub-pathway assignment provided for each known metabolite in the GGM neighborhood, (2) the GO functional terms for the associated gene of all genome-wide significant GWAS hits, and (3) the KEGG pathways on which the associated genes lie. To the best of our knowledge, there is presently no consistent mapping between annotations from the different data sources available for both metabolites and genes, so we here had to perform the only non-automatic step in the analysis: By manual interpretation of different functional classes (Figure 4A), we derive a single consensus pathway annotation for a total of 106 of the unknown metabolites (Figure 4B). For 98 unknowns, we obtained annotations from the GGM network, with 74 of these hits representing direct GGM hits. From the 28 genetic hits introduced above, 27 were in a genetic region with gene annotation. Overlaying the direct edge GGM set and the GWAS set, we obtained 16 unknowns with both biochemical and genetic evidence (Figure 4C). A list of all functional evidence along with the respective predictions can be found in Table S1.
In the following, we selected several unknowns that were forwarded to detailed analysis and experimental validation. Five cases were obtained from the set of 16 high-confidence predictions in the previous section, since the combined evidence from GWAS and GGMs provides rich functional annotations that allow to derive possible compound candidates. Moreover, in order to demonstrate the power of GGMs in the absence of genetic associations, we selected one further case (HETE) where publically available pathway information was systematically exploited. Experimental validations were performed by running pure candidate compounds on the LC-MS/MS platform. For cases where no pure compound was available, we determined exact molecule masses and revisited the retention times and fragmentation spectra.
We investigated six metabolic scenarios in-depth and attempted experimental confirmation of the respective predictions (Table 3). In the following, we discuss three example cases, termed DIPEPTIDE, STEROID, and HETE (Figure 5). Three further examples, named CARNITINE, BILIRUBIN, and ASCORBATE, are presented as Text S3. In the discussion of these scenarios we now use all available evidence, the metabolite correlations, genetic associations, biochemical data, and in addition the molecular masses reported with the known and unknown compounds (which do not represent exact masses at this point). Note that the presented scenarios represent the only cases where a detailed investigation has been attempted. Moreover, the candidate compounds mentioned in the following paragraphs and the supplementary material are the only compounds that have been experimentally tested (there are no negative results not reported in this text).
Our first scenario, DIPEPTIDE, presents the prediction and successful validation of three unknown metabolites involved in short-peptide metabolism (Figure 5, left). In the GGM, we observe X-14205, X-14208 and X-14478 in close proximity to various dipeptides, to glutathione derivatives, and to two longer fibrinogen-related peptides. The primary pieces of genetic evidence for this case are the GGT1 locus, which shows a strong association to S-gluthathionyl-L-cysteine, and the ACE locus, which connects to aspartyl-phenylalanine, X-14205, and X-14208. GGT1 encodes for the protein γ-glutamyl transpeptidase, which transfers glutamyl-residues from glutathione in order to generate short-chain peptides . This fits well into the network picture, since GGT1 is connected to the glutathione derivative, which in turn shares a GGM edge with γ-glutamyl-glutamine. ACE, on the other hand, encodes the angiotensin I converting enzyme, a peptidase that cleaves dipeptide fragments from angiotensin precursors and other functional oligopeptides. Since the biochemical and genetic evidence pointed us to short peptides, and dipeptides in particular, we enumerated all possible 400 (=20×20) combinatorial variants of dipeptides and checked the mass against the masses of the three unknowns under investigation. As an example, we shortened the list of candidates for X-14208 from 2,732 (ChemSpider search) to only 8 molecules, respectively.
For experimental validation, we first checked the plausibility of the candidates with respect to the fragmentation spectra and determined the exact masses. The accurate mass determined for X-14208 is 252.11172±0.001 Da, supporting the chemical formula C12H16N2O4. While the formula still matches more than 1,200 molecular structures, the prediction of this unknown as a dipeptide leaves only two candidate molecules, namely phenylalanylserine (Phe-Ser) and serylphenylalanine (Ser-Phe). Both variants were obtained from a commercial source and run on the LC-MS/MS platform. The retention index  and the fragmentation spectrum received for Phe-Ser matched the index and spectrum of X-14208, whereas Ser-Phe produced a clearly different spectrum (Figure 6). Thus, the identity of X-14208 was experimentally confirmed as the dipeptide phenylalanylserine. Importantly, using our integrated approach, we were able to identify X-14208 by only testing two candidate molecules. The other two unknowns, X-14205 and X-14478, were identified through similar experiments as α-glutamyltyrosine (α-Glu-Tyr) and phenylalanylphenylalanine (Phe-Phe), respectively.
In the second scenario, STEROID, we investigated an unknown metabolite (X-11244) for which both GGM and GWAS data strongly indicate an identity related to steroid-hormone compounds: X-11244 is tightly linked via GGM edges to dehydroepiandrosterone sulfate and two other unknowns, which in turn connect to epiandrosterone sulfate and androsterone sulfate (Figure 5, middle). Furthermore, X-11244 displays a highly significant genetic association (p=2.1×10−26) with rs296391, which lies in strong LD in the SULT2A1 gene locus. SULT2A1 encodes for a member of the sulfotransferase family 2A, dehydroepiandrosterone-preferring, further strengthening the metabolic context. Based on the GGM and GWAS results, we hypothesized that X-11244 is a steroid sulfate related to androstane.
Experimentally, the primary loss of a fragment with a nominal mass of 98 and the presence of an ion at 97 m/z observable in the fragmentation spectrum of X-11244 indicate the presence of at least one sulfate group in this unknown (Text S3). The exact mass determined for X-11244 supports the chemical formula C19H30O8S2. Querying ChemSpider for this chemical formula yields only four results, one of which corresponds to an androstene disulfate variant (ChemSpider ID 21403154). Analysis of several disulfated androstenes demonstrated similar retention times and fragmentation spectra. Among the tested variants, 4-androsten-3β,17β-disulfate showed the best match. Given that other isomers are also possible, which cannot necessarily be chromatographically resolved, we annotated X-11244 more generically as ‘androstene disulfate’.
In the third scenario, HETE, we made explicit use of known biochemical interaction derived from three publically available pathway databases. We searched for cases, where an unknown shows GGM connections to known compounds for which a direct pathway interaction with a metabolite having the same mass as the unknown exists. Such cases are rare, but if present provide a strong argument for an unknown's identity. We selected the HETE scenario as an example for experimental validation in such a situation. The unknown metabolite X-12441 does not show any genome-wide significant SNP hits and only a single GGM neighbor: cis-5,8,11,14-eicosatetraenoic acid (arachidonate, Figure 5, right). Arachidonate constitutes pathway connections to several other lipid-related metabolites, including a variety of hydroxy-arachidonate variants (HETEs). These variants have the chemical formula C20H32O3 with a molecular weight of 320.2351 Da, matching the mass of the unknown. We thus hypothesized that X-12441 represents a specific HETE species.
Experimentally, the determination of the exact mass of the unknown further supported our hypothesis, as the accurate mass determined for X-12441 matches the chemical composition of HETE to a precision of ±0.002 Da. A number of HETE isoforms were experimentally tested, including the 5, 8, 9, 11, 12 and 15 isoforms. All isoforms produced unique fragmentation spectra that permitted the precise identification of the unknown X-12441 as the 12-HETE isoform (see Text S3 for fragmentation spectra).
Three further scenarios, involving acylcarnitines in connection to mitochondrial β-oxidation enzymes (CARNITINE), bilirubin metabolism (BILIRUBIN), and a connection between ascorbate metabolism and catechol-O-methyltransferase (ASCORBATE), are described in Text S3. Briefly, in the CARNITINE scenario, X-11421 and X-13431 were experimentally confirmed as acylcarnitines containing 10 and 9 carbon atoms, respectively (namely cis-4-decenoyl-carnitine and nonanoyl carnitine). For the BILIRUBIN scenario, the unknown metabolite X-11793 was identified as an oxidized bilirubin variant. Lastly, in the ASCORBATE scenario, X-11593 was predicted and experimentally confirmed as O-methylascorbate. Taken together, we here predicted and experimentally validated the biochemical identity of nine unknown metabolites in six different biochemical scenarios.
We developed and validated a novel integrative approach for the biochemical characterization of “unknown metabolites” from high-throughput metabolomics and genotyping datasets. Our method allows for the functional annotation of previously unidentified metabolites and, as a consequence, enhances the interpretability of metabolomics data in genome-wide association studies and biomarker discovery. For the first time, we systematically evaluated genetic associations of unknown metabolites, thereby discovering seven new loci of metabolic individuality. By classifying a series of unknown metabolites, we gained new insights into the functional interplay between genetic variation and the metabolome both for previously reported and new loci. Furthermore, several of the unknown compounds that we identified as well as their newly associated loci were independently reported in disease-related studies. In the following, we discuss three genetic loci and their associated phenotypes.
The first example is a recent biomarker study, where Milburn et al.  reported an association of X-11593 with hepatic detoxification. In our GWAS, we find a strong association of X-11593 with the COMT locus, which encodes the catechol-O-methyltransferase enzyme. COMT is responsible for the inactivation of catecholamines such as L-dopa and various neuroactive drugs by O-methylation . Following our identification approach, we experimentally confirmed the identity of X-11593 as O-methylascorbate. Notably, O-methylascorbate is a known product of ascorbate (vitamin C) O-methylation by COMT , . Thus, our observations establish a link between O-methylascorbate blood levels, common genetic variation in the COMT locus and COMT-mediated liver detoxification processes.
The second example relates to the ACE gene locus, which is a known risk locus for cardiovascular disease, hypertension and kidney failure. The protein encoded by the ACE locus, angiotensin-converting enzyme, is an exopeptidase which cleaves dipeptides from vasoactive oligopeptides, and plays a central role in the blood pressure-controlling renin-angiotensin system . Moreover, the ACE protein is a target for various pharmaceuticals (ACE inhibitors), especially in the treatment of hypertension . In our study, we identified three unknowns as dipeptides (X-14205, X-14208 and X-14478), two of which also associated with the ACE locus. These dipeptides could thus represent novel, interesting biomarkers for the activity of ACE. Moreover, Steffens et al.  reported a connection between heart failure and X-11805, which is in close proximity to angiontensin-related peptides in the GGM. This connection might be revisited after a successful identification of X-11805 in a future study.
The third example is an explorative study to detect biomarkers for insulin sensitivity. Gall et al.  reported several known metabolites (most prominently α-hydroxybutyrate) as biomarkers for insulin resistance. They also reported a series of unknown metabolites among their top hits. In the present study, we investigated three of these unknowns: X-11793 associates with UGT1A (UDP glucuronosyltransferase 1) and represents a bilirubin-related substance. Moreover, we experimentally validated X-11421 and X-13431, which display a strong association with ACADM (acyl-Coenzyme A dehydrogenase, C-4 to C-12 straight chain), as acylcarnitines containing 10 and 9 carbon atoms, respectively. The identification of these latter two unknown metabolites as medium-chain length acylcarnitines is coherent with reports by Adams et al. . The authors found elevated blood plasma acylcarnitine levels in women with type 2 diabetes. Functionally, they attributed this finding to incomplete β-oxidation. Thus, our identification of X-11421 and X-13431 now suggests incomplete β-oxidation as an explanation for the associations found by Gall et al. and implies that acylcarnitines containing 10 and 9 carbon atoms are potential biomarkers for insulin resistance.
In summary, we integrated high-throughput metabolomics and genotyping data from a large population cohort for elucidating the biochemical identities of unknown metabolites. To this end, we applied metabolomics genome-wide association studies and Gaussian graphical modeling in order to link these unknown metabolites with known metabolic classes and biological processes. For six specific scenarios, we went from systematic hypothesis generation over detailed investigation and identity prediction to direct experimental confirmation. Similar validations may now be undertaken for the remaining predictions that we report in Table S1. Finally, we demonstrated the benefit of our method by discussing several of these newly identified metabolites in the context of existing biomarker discovery studies on liver detoxification, hypertension and insulin resistance.
It is to be noted that our method does not specifically require genotyping data. Even metabolomics measurements alone, analyzed through the GGMs, may provide sufficient information for the classification and even precise identity prediction. The unknowns with GGM evidence but without GWAS hits in Figure 4 as well as the HETE scenario represent examples for this approach.
One limitation of our approach is the requirement for associations with functionally described loci or known metabolites. Certain metabolite groups might thus systematically not be identifiable. For instance, if the identity of a whole class of biochemically related molecules is unknown (which might be due to experimental reasons), then the GGM associations between those compounds will not aid in identity elucidation. The 118 unknown compounds for which we could not derive any classification might represent such cases. Thus, our functionally oriented method should be regarded as a complementary extension to the existing identity determination methods.
Accordingly, our approach can be extended in several directions. It can be combined with method-specific, automated techniques that further exclude sets of metabolites. Previously mentioned methods relying on mass-spectra  or chromatographic properties  are suitable candidates here. Moreover, the method can be directly transferred to other types of metabolomics datasets not specifically originating from MS experiments, such as NMR-based metabolomics.
Beyond the application to metabolite identification, our study demonstrates the general potential of functional metabolomics in the context of genome-wide association studies. The comprehensive metabolic picture provided by GGMs in combination with GWAS allows for the detailed analysis of metabolic functions, chemical classes, enzyme-metabolite relationships and metabolic pathways.
We used data from n=1768 fasting serum samples used in a previously published genome-wide association study on a German population cohort. Details of the sample acquisition and experimental procedures can be found in . Briefly, metabolic profiling was done using ultrahigh-performance liquid-phase chromatography and gas-chromatography separation, coupled with tandem mass spectrometry. The dataset contains a total of 292 known compounds and, in addition to the GWAS study in , 225 unknown compounds. Metabolite concentrations were log-transformed since a test of normality showed that in most cases the log-transformed concentrations were closer to a normal distribution than the untransformed values . Genotyping was carried out using the Affymetrix GeneChip array 6.0. For our analyses, we only considered autosomal SNPs passing the following criteria: call rate >95%, Hardy-Weinberg-Equilibrium p-value p(HWE)>10−6, minor allele frequency MAF>1%. In total, 655,658 SNPs were left after filtering.
In order to avoid spurious false positive associations due to small sample sizes, only metabolic traits with at least 300 non-missing values were included and data-points of metabolic traits that lay more than 3 standard deviations off the mean were excluded by setting them to ‘missing’ in the analysis (leaving 273 known and 213 unknown metabolites). Genotypes are represented by 0, 1, and 2 for major allele homozygous, heterozygous, and minor allele homozygous, individuals respectively. We employed a linear model to test for associations between a SNP and a metabolite assuming an additive mode of inheritance. Statistical tests were carried out using the PLINK software (version 1.06)  with age and gender as covariates. Based on a conservative Bonferroni correction, associations with p-values<1.6×10−10 meet genome-wide significance, corresponding to a significance level of α=0.05. SNP-to-gene assignments were derived via linkage disequilibrium (LD) from HAPMAP . A SNP was associated with a gene whenever there was at least one other SNP lying in the transcribed region of this gene (that is from 5′UTR to 3′UTR) that displays an r2≥0.8 with the query SNP. A detailed description of the GWAS procedure can be found in .
Lookups of previously known associations between phenotypes and genetic variants were performed using the GWAS catalog . We list a phenotype with one of our GWAS hits, if the phenotype was reported with at least one SNP that displays an LD r2≥0.5 with the respective “lead SNP”. Lookups of eQTLs were performed for all significant SNPs (474, see Dataset S1) using the GTEx database . We applied a p-value cutoff of 2.7×10−9, corresponding to a significance level of 0.05 and correction for 474×40,000 tests (the number of SNPs times number of transcripts, conservative estimate). Detailed results up to a p-value of 10−5 can be found in Dataset S1.
For the GGM calculation, we require a full data matrix without missing values. From the original data matrix containing n=1768 samples and 517 metabolites (thereof 292 knowns and 225 unknowns), we first excluded metabolites with more than 20% missing values (column direction), and then samples with more than 10% missing values (row direction). The filtered data matrix still contained n=1764 samples with 355 metabolites (217 knowns and 138 unknowns). Remaining missing values were imputed with the ‘mice’ R package . Note that the numbers of metabolites used in the GWAS and in the GGM analysis differ due to specific constraints for the treatment of missing values in the two methods.
Gaussian graphical models are induced by full-order partial correlation coefficients, i.e. pairwise correlations corrected against all remaining (n-2) variables. GGMs are based on linear regressions with multiple predictor variables. When regressing two random variables X and Y on the remaining variables in the data set, the partial correlation coefficient between X and Y is given by the Pearson correlation of the residuals from both regressions. Since our dataset contains more samples than variables, full-order partial correlations can be conveniently calculated by a matrix inversion operation. A significance cutoff of α=0.05 with Bonferroni correction was applied. A detailed description of the GGM calculation procedure can be found in .
Age, gender and SNP effects were removed by adding the respective variables and SNPs states to the data matrix. For each pair of variables under investigation, Gaussian graphical models remove the effects of all remaining variables on this correlation (due to the above-mentioned linear regression approach). That is, adding a variable to the data matrix will automatically result in the removal of confounding effects of this variable on the correlations of all other variables. Note that age, gender and SNPs were not investigated as an actual node in the network but merely used for the correction procedure. For the later analysis steps, we then only considered metabolite-metabolite edges in the network. SNP states were coded as numerical values of 0, 1 and 2 (see previous section), such that the linear regressions that underlie the GGM correspond to an additive genetic model (cf. ). Gender represents a “dummy variable”  in the linear regression model which only takes values of 1 (male) and 0 (female).
Metabolic reactions were imported from three independent human metabolic reconstruction projects: (1) H. sapiens Recon 1 from the BiGG databases , (2) the Edinburgh Human Metabolic Network (EHMN) reconstruction  and (3) the KEGG PATHWAY database  as of January 2012. We attempted to create a highly accurate mapping between the different metabolite identifiers of the respective databases, in order to ensure the identity of each compound in our list. Entries referring to whole groups of metabolites, such as “phospholipid”, “fatty acid residue” or “proton acceptor” were excluded from our study. Furthermore, we did not consider metabolic cofactors such as “ATP”, “CO2”, and “SO4” etc. in our analysis, since such metabolites unspecifically participate in a plethora of metabolic reactions. For each enzyme catalyzing one or more reactions in our pathway model, we retrieved functional annotations from two independent sources: (i) GO-Terms from the Gene Ontology  and (ii) enzyme pathway annotations from the KEGG PATHWAY database .
All imported metabolic pathways along with metabolite database identifiers, excluded compounds and pathway annotations can be found in Dataset S1.
ZIP archive containing Excel sheets and .graphml files for the GGM and GWAS results, as well as detailed pathway annotations.
GGM sub-network with bilirubin variants.
Systematic classifications for 106 unknown metabolites.
Detailed GGM modularity analysis results.
Assessment of the majority voting classification approach.
Supporting experimental data and description of the unknown identification scenarios CARNITINE, BILIRUBIN, and ASCORBATE.