|Home | About | Journals | Submit | Contact Us | Français|
We use whole genome sequence data and rare variant analysis methods to investigate a subset of the human serum metabolome, including 16 carnitine-related metabolites that are important components of mammalian energy metabolism. Medium pass sequence data consisting of 12,820,347 rare variants and serum metabolomics data were available on 1,456 individuals. By applying a penalization method, we identified two genes FGF8 and MDGA2 with significant effects on lysine and cis-4-decenoylcarnitine, respectively using Delta-AIC and LRT statistics. Single variant analyses in these regions did not identify a single low frequency variant (MAC>3) responsible for the underlying signal. The results demonstrate the utility of whole genome sequence and innovative analyses for identifying candidate regions influencing complex phenotypes.
The metabolome is a collection of small molecules resulting from a multitude of cellular and physiologic processes (Pearson, 2007). Measurement and analysis of the metabolome in well characterized sample sets can provide insight into the underlying mechanisms of genomic and environmental actions and interactions on metabolism. In addition, investigations of the metabolome are likely to prove more productive for novel gene and locus discovery because the metabolome is functionally closer to the DNA level, where effect sizes will be larger than similar analyses of distant risk factor levels or disease endpoints (Yu et al., 2014). Unlike proteomics and genomics, which have enjoyed considerable concentrated efforts, the metabolome is relatively under-studied, partly because of the difficulty of measuring and annotating such a large number of compounds, and the identity of many of the analytes in the metabolome remain unknown. Indeed, genetic studies of the metabolome can lead to insight into the identity of these unknown compounds (Zheng et al., 2013). Multiple genomic studies have been done to identify common variants influencing subsets of the metabolome in plasma (Yu et al., 2014; Rhee et al. 2013) and urine (Suhre et al., 2011), and such studies have been valuable tools in agriculture research (Wen et al., 2014). GWAS studies, however, capture the influence of common variants and exome sequencing is limited to the small protein-encoding portion of the genome (Demirkan et al., 2015).
Whole genome sequencing in large sample sizes is now a reality (Morrison et al., 2013). Analysis of whole genome sequence offers the advantage of ascertaining genome variation outside of the protein-encoding region and the ability to investigate the role of rare variants on phenotypic variation. Studies using whole genome sequence to analyze phenotypes also encounter a number of challenges that must be overcome. First, there are a very large number of rare variants and many of these variants are limited to a single individual. Therefore, usual single site methods comparing mean values or frequencies do not have sufficient statistical power and are often not appropriate or feasible (Lee et al., 2014, Yazdani et al., 2015). Second, outside of genes and a few well-characterized regulatory elements, the role of much of the genome in normal development, metabolism and physiology remains unknown. Despite impressive advances from studies such as ENCODE (Consortium et al., 2011), predicting the consequences of sequence variation outside of a protein encoding gene is difficult.
We use whole genome sequence data and rare variant analysis methods introduced by Yazdani et al. (2015) called CCRS to investigate a model subset of the human serum metabolome, specifically the carnitine-related metabolites. Carnitine is synthesized in the liver and kidney and is required for the transport of fatty acids within mitochondria for energy production (Steiber et al. 2004). A major source of carnitine in the body, however, is dietary, with red meat having the highest levels (Flanagan et al. 2010). There are number of genetic forms of carnitine deficiency that influence the cycling of carnitine or acylcarnitine from the cytosol to the mitochondrial matrix (Olpin, 2004). Primary carnitine deficiency is a rare autosomal recessive disorder of fatty acid transport caused by mutations in SLC22A5 (Li et al. 2010). However, the complete picture of the genetic architecture of carnitine is unclear. In this study, we use whole genome sequence data containing 12,820,347 rare variants (i.e. minor allele frequency, MAF, less than 5%) to analyze 16 metabolites involved in carnitine synthesis and metabolism measured on 1,456 European-Americans from the Atherosclerosis Risk in Communities (ARIC) study. For this analysis, we implemented the CCRS rare variant analysis and used sliding windows across the genome in order to identify two loci spanning the FGF8 and MDGA2 genes associated with lysine and cis-4-decenoylcarnitine levels, respectively.
Metabolomic and genomic data were available on a subset of the Atherosclerosis Risk in Communities (ARIC) study, a biracial longitudinal cohort study of 15,792 middle-aged individuals who were randomly sampled from four US communities and have been measured for multiple risk factor phenotypes related to health and chronic disease. A detailed description of the ARIC study design and methods has been published elsewhere (The ARIC investigator, 1989). The data presented here includes 1,456 European-American individuals having metabolomic and whole genome sequence data. Written informed consent was obtained from each study participant, including that for broad data sharing. The dbGAP accession number for ARIC study data, including genomic data, is phs000668.v1.p1.
Metabolic profiling was carried out on fasting serum samples from the baseline examination and stored at −80oC. Metabolites were measured using untargeted gas chromatography-mass spectrometry and liquid chromatography-mass spectrometry (GC-MS and LC-MS)-based quantification protocols. Details of these procedures have been previously published (Zheng et al., 2014). The analyses presented here focus on 16 metabolites in the pathway of carnitine synthesis and metabolism, which were near normally distributed or could be transformed to such and had a missing data rate less than 25%.
Missing data were imputed using the k nearest neighbor algorithm of (Verboven et al., 2007). This algorithm starts from a complete subset of the data set Xc and sequentially estimates the missing values for an incomplete observation, x*, by minimizing the determinant of the covariance of the augmented data matrix X* = [Xc; x’]. Then the observation x* was added to the complete data matrix and the algorithm continues with the next observation.
Details of the sequencing methods are in the supplementary materials of Morrison (2013). Briefly, automated Illumina PE libraries were barcoded with Illumina’s multiplex adapters and pooled for sequencing in sets of three samples to generate an average of 6-fold sequence coverage per sample. Methods for WGS sequencing followed standard Illumina PairEnd library protocols with minor modifications. Variants were called using SNPTools (http://www.hgsc.bcm.tmc.edu/cascade-tech-software_snp_tools-ti.hgsc), which consists of variant discovery, genotype likelihood estimation, and genotype inference via imputation. Quality control steps included: low coverage, low variant quality, high degree of relatedness, and evidence of hidden population substructure excluding variants and individuals. To identify population structure, we calculated principal components, PCs, over individuals using common variants and selected first ten PCs. We then adjusted metabolites using those PCs in addition to age and sex as covariates by fitting a linear regression and using its residuals as outcome for the main analysis.
To address the multitude of analytic challenges confronted when analyzing rare variants across the genome, we applied a new statistical approach based on penalization methods by Yazdani et al. (2015). This approach follows a two-stage analysis. In the first stage, it applies principal component analysis with convex penalization (Witten et al., 2009) to avoid over fitting due to linkage disequilibrium (LD) among variants (Talluri et al., 2013; Feng et al., 2014) in a multiple regression model. In the second stage, it uses concave penalization in a multiple regression model to obtain sparse estimates of the coefficient for the principal components, in the sense that some of the coefficients may be explicitly shrunk to zero (Frank et al., 1993; Fu, 1998). Although this method is a multi-stage analysis, it does not increase the false discovery rate since the first stage is an unsupervised analysis approach. Instead, it avoids missing important region through multiple hypotheses testing steps by incorporate local LD structure, and maintains power while keeping the false discovery rate low by considering sparsity in the data. This method is called CCRS due to its convex and concave penalization for rare variants analysis.
We implemented the CCRS method in sliding windows across the genome. Each window included 50 variants with a step size of 25 variants to the next window. We selected the most promising window associated with the metabolites using Δ-AIC (Burnham and Anderson, 2002) and Likelihood Ratio test (LRT). The Δ-AIC is a model selection criterion that comparers the Akaike’s (1973) information criterion (AIC) of all the models with model with min AIC, i.e. Δ-AICi = AICi- min (AIC). Δ-AICi < 2 suggests substantial evidence for the ith model, values between 3 and 7 indicate that the model has considerably less support, whereas an AICi>10 indicates that the model is very unlikely (Burnham and Anderson, 2002).
We also calculated a likelihood ratio test (LRT) statistic to test the null hypothesis that considers no genotype effect within a window. The LRT approximately follows a chi-square distribution with degrees of freedom equal to a complex function of the design matrix of the model and the penalization (Hastie et al., 2005). A permutation test was carried out to estimate the empirical distribution of p-values under the null hypothesis and to calculate a quantile threshold defining statistical significance for each trait (Box 1).
There were 16 analyses in the carnitine synthesis and metabolism pathways that were represented in this metabolomics dataset. Their names, averages and other descriptive statistics for the study sample are shown in Table I and Table II. By design, the sample consisted of only European-Americans. Consistent with enrollment and participation, there were more females than males in the study sample, and the average BMI was 27.3. In general, the individuals included in this metabolomics substudy were similar to participants in the entire ARIC study.
The QQ plots for all 16 metabolites are shown in the online Supplement. Two genomic regions in the analysis of lysine and cis-4-decenoylcarnitine yielded minimum AIC such that Δ-AICi for all other models are greater than 4.2 and 2.1, respectively, in addition to having very small p-values relative to other calculated p-values using LRT. These regions advanced to the permutation algorithm described in Box 1 to obtain a significance threshold for these two metabolites. By permuting each metabolite n=100,000 times and calculating the LRT statistic and corresponding p-value in m=1000 randomly selected windows, we defined the significance threshold to be 10−8 and 10−6 at the level of type I error equal to α=0.001 for lysine and cis-4-decenoylcarnitine, respectively. Using these two thresholds and Δ-AICi, we identified two significant regions listed in Table III. Among the sites having a MAF<0.05 included in this analysis, the median minor allele frequency within the significant windows was 0.0034 and 0.0014 for lysine and cis-4-decenoylcarnitine, respectively. To observe the influence of single variants in significant regions on the traits of interest, we used simple regression to estimate the effects of each variant with minor allele count, MAC, greater than 3. The single variants having a p-value < 0.05 are shown in Table IV.
Figure 1 and Figure 2 show regional plots of 201 windows centered on the significant regions for lysine and cis-4-decenoylcarnitin, respectively. The differences between the very small p-value in the selected regions and other surrounding regions can be readily observed. There appears to be a background null distribution of p-values > 10−4, and the two regions of interest rise notably above this background. To help fine map the possible signal in the region significantly associated with lysine included two genes, we serially excluded variants in FGF8 or NPM3 and computed the p-values using CCRS for each subset.
After excluding 11 variants in FGF8, the signal for the metabolite lysine is attenuated; the p-value of the model is increased from 1.75e-10 to 2.63e-06. In contrast, the signal was not changed appreciably after excluding 7 variants in the NPM3; the p-value of the model is increased to 4.01e-09. Figure 3 shows a regional plot of individual variant effects with MAC>3 based on a single-based regression approach. The figure shows multiple variants with small P-values within FGF8 gene. The SNP on the right hand side of figure 3 is outside of the boundaries of the NPM3 gene.
In addition to the CCRS method, we applied other common methods for rare variants analysis such as CAST (Morgenthaler et al., 2007) and SKAT-O (Lee et al., 2012). The results of these analyses, identified one significant region using SKAT-O and this region is on chromosome 14 and is the same region associated with cis-4-decenoylcarnitine in our analysis using CCRS.
Whole genome sequence analysis of complex traits is in its infancy, and to date there has been only one other foray into the area (Morrison et al., 2013). Success depends on the bringing together of multiple complementary lines of investigation. In this study, we combined whole genome sequencing in 1,456 individuals, detailed metabolomics measures of 16 analytes involved in carnitine synthesis and metabolism, and innovative analytic methods that take into account the very large number of rare genomic variants while maintaining statistical power and controlling the false discovery rate in order to identify that the FGF8 and MDGA2 genes were associated with inter-individuals variation in lysine and cis-4-decenoylcarnitine, respectively. In the absence of replication or detailed mechanistic studies, these findings should be viewed as preliminary, and FGF8 and MDGA2 should be viewed as candidate genes influencing lysine and cis-4-decenoylcarnitine, respectively.
We selected carnitine biosynthesis and metabolism for this whole genome analysis because of its moderate size (i.e. 16 metabolites were represented in the serum metabolomics measurements) and its role in health and disease, primarily through fatty acid metabolism (Hoppel C. et al., 2003). In addition to overt carnitine deficiency, carnitine supplementation has been suggested as a therapeutic for multiple heart- and muscle-related disorders (Kelly G. S. et al., 1998). Biosynthesis of carnitine occurs primarily in the liver and kidney from the amino acids lysine and methionine (Bremer J., 1983). Lysine is an essential amino acid and is abundant in foods high in protein. In addition to lysine, another metabolite having significant effects from this whole genome analysis was cis-4-decenoylcarnitine. Unlike its cousin octanoylcarnitine, there is little literature on the function and metabolism of decenoylcarnitine. Decenoylcarnitine is elevated with carnitine treatment (Vernez et al., 2006). Bhuiyan et al. (1992) report that decenoylcarnitine can be found in the urine of patients suspected of having defects in mitochrondrial beta-oxidation.
The mechanisms of the association of FGF8 and MDGA2 with lysine and cis-4-decenoylcarnitine, respectively, is unknown. Although carnitine is synthesized from the essential amino acid lysine, we did not observe evidence that reduced lysine leads to reduced carnitine. In the data set analyzed here, the correlation between lysine and carnitine was only 0.02. In addition, individuals with rare variants in FGF8 associated with reduced lysine did not show deficiency in the other 15 metabolites selected from the pathway of carnitine synthesis and metabolism (data not shown). This most likely is due to only a small percentage of lysine going to carnitine biosynthesis. FGF8 is a member of the fibroblast growth factor family expressed at high levels in the testes and ovaries (www.genecards.org), and the protein is necessary for normal brain development and maintenance, specifically the borders among certain segments (Crossley and Martin, 1995). Carnitine metabolism is involved in both male fertility (Ahmed S. D. et al., 2011) and a variety of brain processes (Nałecz K. A. et al., 2004). Furthermore, relative lysine deficiency may be related to multiple cognitive and behavioral-related disorders by acting as precursor of glutamate in the central nervous system (Papes F. et al., 2001) and its relationship with cortisol (Smriga M. et al., 2007).
Using the CCRS test, MDGA2 was significantly associated with cis-4-decenoylcarnitine. In follow-up analyses, there were three rare synonymous variants that were individually significantly related to the levels of this metabolite. As can be seen in Table III, one of the variants that is in the beginning of the gene increased levels, while the other two toward the end of the gene decreased levels. MDGA2 is expressed in the central and peripheral nervous system in the rat (http://rgd.mcw.edu/). Previously, MDGA2 was known as MAMDC1. MDGA2 gene variation has previously been associated with multiple neurologic and behavioral conditions (https://www.ebi.ac.uk/gwas/). Similarly, carnitine and carnitine metabolites have also been related to multiple neurologic and behavioral conditions, especially autism (e.g. Filipek P. A. et al., 2004).
The results reported here document the feasibility of whole genome sequence analysis of endophenotypes close to the gene level. They also document the benefits of applying multiple methods of analysis, in this case a burden test, SKAT and CCRS, to promote discovery because each method has relative strengths and weaknesses. In the future, sample sizes for whole genome sequence analysis of metabolomics data will increase, partly fueled by declining costs. Likewise, projects such as ENCODE and GTEX will permit better annotation of non-genic coding regions. In addition to the domains of genetic epidemiology, adequate sample sets, quality phenotyping and whole genome sequencing, future studies will need to directly incorporate high through put functional assays as an integral part of discovery.
The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C, and HHSN268201100012C), R01HL087641, R01HL59367 and R01HL086694; National Human Genome Research Institute contract U01HG004402; and National Institutes of Health contract HHSN268200625226C. The metabolome measurement work obtained through support from the National Genome Research Institute (HG004402) and the generous support of the University of Texas Health Science Center at Houston. The DNA sequence data work obtained through support from the National Heart Lung and Blood Institute (HL102419) and National Human Genome Research Institute (HG003273 and HG006542) of the National Institute of Health.
The authors declare that they have no competing interests.