|Home | About | Journals | Submit | Contact Us | Français|
The primary circulating form of vitamin D is 25-hydroxy-vitamin D (25(OH)D), a modifiable trait linked with a growing number of chronic diseases. In addition to environmental determinants of 25(OH)D, including dietary sources and skin ultraviolet B (UVB) exposure, twin and family-based studies suggest that genetics contribute substantially to vitamin D variability with heritability estimates ranging from 43% to 80%. Genome-wide association studies (GWAS) have identified SNPs located in four gene regions associated with 25(OH)D. These SNPs collectively explain only a fraction of the heritability in 25(OH)D estimated by twin and family based studies. Using 25(OH)D concentrations and GWAS data on 5,575 subjects drawn from 5 cohorts, we hypothesized that genome-wide data, in the form of (1) a polygenic score comprised of hundreds or thousands of SNPs that do not individually reach GWAS significance, or (2) a linear-mixed-model for genome-wide complex trait analysis, would explain variance in measured circulating 25(OH)D beyond that explained by known genome-wide significant 25(OH)D associated SNPs. GWAS identified SNPs explained 5.2% of the variation in circulating 25(OH)D in these samples and there was little evidence additional markers significantly improved predictive ability. On average a polygenic score comprised of GWAS identified SNPs explained a larger proportion of variation in circulating 25(OH)D than scores comprised of thousands of SNPs which were on average, non-significant. Employing a linear-mixed-model for genome-wide complex trait analysis explained little additional variability (range 0-22%). The absence of a significant polygenic effect in this relatively large sample suggests an oligogenetic architecture for 25(OH)D.
Vitamin D is a hormone essential for normal growth and development. The primary circulating form, 25-hydroxy-vitamin D (25(OH)D), is a modifiable trait with well documented effects in bone, muscle and mineral homeostasis, that has also been linked with a growing number of diseases ranging from malignancy, most notably colorectal cancer to autoimmune diseases including multiple sclerosis[Holick 2004; Munger, et al. 2006]. While vitamin D can be derived from dietary sources, it is mainly produced by skin ultraviolet B (UVB) exposure. To obtain the biologically active metabolite 1, 25-dihydroxyvitamin D3 (1,25(OH)2D), vitamin D undergoes a series of hydroxylation steps primarily in the liver and kidney. 1,25(OH)2D, is the most biologically active metabolite of vitamin D but its levels are under tight homeostatic regulation and thus 25(OH)D is considered the best measure of vitamin D status[Holick 2004].
In addition to environmental determinants of circulating vitamin D levels, twin- and family-based studies suggest that genetics contribute substantially to vitamin D variability with heritability estimates ranging from 43% to 80%[Hunter, et al. 2001; Orton, et al. 2008; Shea, et al. 2009; Wjst, et al. 2007]. These estimates are often derived from comparisons of the higher intraclass correlation (ICC) observed in monozygotic (MZ) twins to that of dizygotic twins (DZ), with this difference providing the first impression of the magnitude of genetic influence[Hunter, et al. 2001; Orton, et al. 2008; Shea, et al. 2009].
There have been two published genome-wide association studies (GWAS) of circulating 25(OH)D concentrations[Ahn, et al. 2010; Wang, et al. 2010]. These studies identified associated variants in the group-specific component gene (GC) that encodes the vitamin D binding protein; in the gene encoding CYP2R1 (cytochrome P450, family 2, subfamily R, polypeptide 1) involved in hydroxylation of vitamin D3 to 25(OH)D; in CYP24A1 which encodes 24-hydroxylase involved in the degradation of 1,25(OH)2D; in the gene encoding DHCR7 (7-dehydrocholesterol (7-DHC) reductase), which converts 7-DHC to cholesterol, thereby removing the substrate from the synthetic pathway of vitamin D3, a precursor of 25-hydroxyvitamin D3, and NADSYN1 (nicotinamide adenine dinucleotide (NAD) synthetase)[Ahn, et al. 2010; Wang, et al. 2010]. However these SNPs collectively explain only a fraction of the heritability in 25(OH)D estimated by twin- and family-based studies. As we describe further below, we estimate that the percentage of residual variance in circulating 25(OH)D explained by SNPs in these gene regions is approximately 5%, comparable to prior studies[Ahn, et al. 2010; Signorello, et al. 2011; Sinotte, et al. 2009; Wang, et al. 2010].
A major limitation of prior GWAS of vitamin D is that the effect sizes of individual alleles are often small, such that the predictive value of a single variant of small effect on circulating 25(OH)D levels is negligible[Evans, et al. 2009]. In an effort to include loci with small effects that may not reach genome-wide significance in GWAS, genome-wide scores calculated by including such SNPs have demonstrated improved discriminative accuracy for complex disease affection status in bipolar disorder, schizophrenia, coronary heart disease, type I and type II diabetes and multiple sclerosis [Bush, et al.; Evans, et al. 2009; Purcell, et al. 2009]. The challenge in employing this method is choosing the optimal threshold for deciding which loci to include in the calculation of the genome-wide score. Too liberal a threshold is likely to incorporate noise and hence obscure any true signal, whereas an overly strict threshold may cause one to disregard loci that genuinely contribute to disease risk. A solution to overcome this problem is to fit all the SNPs from the GWAS simultaneously. The software tool, genome-wide complex trait analysis (GCTA) employs linear-mixed-models to fit the effects of all SNPs as random effects[Yang, et al. 2011]. This approach provides an estimate of the percentage variance explained by all the SNPs genotyped and tagged by the genome-wide platform[Yang, et al. 2010].
We hypothesized that by using genome-wide data in the form of (1) a polygenic score comprised of hundreds or thousands of SNPs that do not individually reach GWAS significance, and (2) a linear-mixed-model for genome-wide complex trait analysis, we could predict variance in measured circulating 25(OH)D beyond that explained by known genome-wide significant 25(OH)D associated SNPs.
We conducted a GWAS of prospectively collected 25(OH)D in five studies with a total of 5,575 individuals (Table 1). The five GWAS were: a case-control study of breast cancer (BRCA) and a case-control study of type 2 diabetes (T2D), both nested within the Nurses’ Health Study (NHS)[Willett, et al. 1987], a case-control study of myocardial infarction (MI) nested within the Health Professionals Follow-up Study (HPFS)[Giovannucci, et al. 2008], and previously selected case-control sets from prior studies which included subjects from the Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study (ATBC) and the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO)[Hayes, et al. 2005].
NHS began in 1976 when 121,700 female US registered nurses 30 to 55 years of age were enrolled and are contacted every 2 years by questionnaire with response rates exceeding 90% per cycle.[Colditz, et al. 1986]. Between 1989 and 1990, 32,826 women between the ages of 43 and 69 provided blood samples. Of those, we included participants with GWAS data and plasma 25(OH)D from nested case-control studies of breast cancer (n=870) [Willett, et al. 1987] and T2D (n=721)[Qi, et al. 2010]. Controls were matched to cases on age and month and year of blood draw.
HPFS began in 1986 with 51,529 US male health professionals, ages 40-75 years, providing baseline demographic and medical information, updated every two years [Koushik, et al. 2006; Rimm, et al. 1992]. Blood samples were collected from 18,225 of the HPFS participants from 1993 to1995. We included in our analysis 1245 HPFS participants with GWAS data and pre-disease diagnosis plasma 25(OH)D from a case-control study of MI[Giovannucci, et al. 2008].
The ATBC study was a randomized, placebo-controlled cancer prevention intervention trial of supplementation with alpha-tocopherol, beta-carotene or both conducted in southwestern Finland from 1985 to 1993[Group 1994]. Participants were all 50-69 year old male smokers. Fasting blood samples used for 25(OH)D measurement were ascertained at study entry andwhole blood collection (DNA source) was done between 1992-1993. GWAS and circulating 25(OH)D levels data were available for 1372 men[Ahn, et al. 2010].
The PLCO study was a large, randomized multi-center trial designed to evaluate the effectiveness of cancer screening and examine early markers of cancer, with approximately 155,000 male and female U.S. residents aged 55 to 74 at baseline were recruited in 1993 to 2001 and all screening-arm subjects provided non-fasting blood samples at baseline [Hayes, et al. 2005]. A total of 1316 Caucasian subjects with GWAS data and serum measurements of 25(OH)D from a nested case-control study of prostate cancer were included in these analyses[Ahn, et al. 2008].
25(OH)D concentrations for BRCA plasma samples were measured by radioimmunoassay (RIA)[Hollis 1997] in three batches, two in Dr. Michael Holick’s laboratory at Boston University School of Medicine (coefficients for variation (CV) 8.7 – 17.6%) and a third in Dr. Bruce Hollis’ laboratory at the Medical University of South Carolina in Charlestown, SC (CV 8.7%) [Ahn, et al. 2010; Bertone-Johnson, et al. 2005]. For the T2D study, plasma levels of 25(OH)D were measured in the Nutrition Evaluation Laboratory in the Human Nutrition Research Center on Aging at Tufts University (CV 8.7%) by rapid extraction followed by equilibrium I-125 RIA procedure as specified by the manufacturer’s procedural documentation and analyzed by gamma counter (Cobra II, Packard)[Ahn, et al. 2010]. 25(OH)D concentrations for the MI plasma samples were measured by RIA in Dr. Hollis’ lab[Giovannucci 2005; Hollis, et al. 1993]. For ATBC and PLCO serum 25(OH)D levels were measured by competitive chemiluminescence immunoassay (CLIA) at Heartland Assays, Inc. by Dr. Ron Horst, with coefficients of variation (CVs) in the blinded duplicate quality-control samples of 9.3% (intrabatch) and 12.7% (interbatch) for ATBC [Ahn, et al. 2010; Gallicchio, et al. 2010] and 8.1% for PLCO.
GWAS genotyping was performed using the Illumina 550K (or higher density) platform in the BRCA, PLCO and ATBC studies and the Affymetrix 6.0 platform in the T2D and MI studies. Quality-control assessment of genotypes including sample completion rates, SNP call rates, concordance rates, deviation from Hardy-Weinberg proportions in control DNA and final sample selection for associational analysis are described in detail elsewhere [Ahn, et al. 2010; Easton, et al. 2007; Hunter, et al. 2007; Qi, et al. 2010; Thomas, et al. 2009]. Genotypes for markers contained on the Illumina 550K platform and not the Affymetrix 6.0 platform were imputed using a Markov Chain based Haplotyper (MACH) and HapMap phase II CEU reference panel (Rel 22), with retention of only high quality SNPs with a MACH-r2 of 0.95 or greater in all data sets. After restricting to SNPs either directly genotyped or imputed to Illumina, we had a total of 359,928 SNPs eligible for inclusion in a polygenic score.
The percentage of variance explained by the 25(OH)D GWAS identified SNPs from the gene regions encoding GC, CYP2R1, CYP24A1 and DHCR7/NADSYN1 was estimated among subjects from the BRCA, T2D, HPFS, PLCO and ATBC studies. Residual sums of squares for the score itself (r2 = variance of the model/total variance) were estimated by linear regression for models containing the four SNPs alone; the SNPs along with 25(OH)D associated covariates (age, BMI, case-control status, region of residence, season of blood draw, quartiles of vitamin D supplement intake, quintiles of vitamin D intake from food sources and two eigenvectors from principal components); and covariates alone. These sums of squares were then used to estimate the proportion of the variance explained by the SNPs after removing variation explained by other covariates, in the form of the adjusted r2 (adjusted r2 = [SSresid(covar) - SSresid(covar+score)]/SSresid(covar))
To create the polygenic scores we first conducted GWAS of 25(OH)D in each of the five studies separately. GWAS adjusted for age, case-control status, sex, BMI (kg/m2), season of blood collection, vitamin D dietary intake, vitamin D supplement intake, and eigenvectors to control for population stratification, with the BRCA, T2D, MI and PLCO cohorts additionally adjusting for US region of residence. These individual study results were then meta-analyzed using an inverse variance weighted approach, as implemented in the software METAL[Ahn, et al. 2010; Willer, et al.].
Using the results of the meta-analysis, SNPs were ranked by decreasing significance (i.e., ascending p-value) for the association with 25(OH)D. An in-house Python script that references the HapMap Rel.23a (NCBI B36) database was used to remove those SNPs in linkage disequilibrium (LD >0.2). The number of SNPs meeting specified p-value cut offs (p ≤1.0 × 10−7, ≤1.0 × 10−5, ≤1.0 × 10−3, ≤0.01, ≤0.05, ≤0.1, ≤0.15, ≤0.2, ≤0.5) was determined. PLINK software[Purcell, et al. 2007] was then used to construct 9 polygenic scores based on the SNPs corresponding to each cut-off value. The polygenic score is an allelic scoring system involving specified SNPs and identified risk alleles (allele associated with increased circulating 25(OH)D), to assign a single quantitative index of genetic risk to each subject. A polygenic score assumes an additive and identical allelic effect for each variant (not dominant or recessive), and no gene-gene interactions.
Five-fold cross validation of the cohorts was completed testing the performance of 9 potential polygenic scores corresponding to specified p-value cut-offs. To conduct cross-validation, training sets were created by meta-analyzing 4 of the 5 cohorts using METAL software[Willer, et al.], leaving the remaining cohort as the testing set. The polygenic score was constructed using the method described above, and the association between the score and circulating 25(OH)D is estimated in the testing set. Of the 9 potential polygenic score models, the model with the highest associated r2 value averaged across the testing sets was taken to be the model explaining the most variance in circulating 25(OH)D.
We repeated the analyses applying the MACH r2 threshold for imputation of 0.3, resulting in a total of 506,671 SNPs eligible for inclusion in a polygenic score. From these SNPs we removed those SNPs in linkage disequilibrium (LD >0.8) and excluded those SNPs within 1 Mb of the four known genome-wide significant SNPs.
SAS/STAT® software (SAS Institute Inc., Cary, NC) was used to conduct a linear regression analysis in each of the 5 testing sets for each of the 9 polygenic score models, and estimate the proportion of the total variation in circulating 25(OH)D due to the score in the crude analysis. We also estimated the proportion of the variance explained by the score after removing variation explained by other covariates including age, BMI, case-control status, region of residence, season of blood draw, quartiles of vitamin D supplement intake, quintiles of vitamin D intake from food sources and two eigenvectors from principal components.
Individual cohorts were analyzed with software for linear-mixed-model analysis using the GCTA software tool (0.91.0). We used genotyped SNPs meeting quality control measures (BRCA: 529,423 SNPs; T2D: 678,082 SNPs; MI: 697,899 SNPs; ATBC: 530,324 SNPs; PLCO:514,636 SNPs) to create a genetic relationship matrix, which uses the restricted maximum likelihood (REML) method to fit a linear-mixed-model to estimate the variance in residual 25(OH)D explained by additive genetic matrix created from these SNPs. The covariates listed above are included as fixed effects and we meta-analyzed the results from the 5 cohorts employing an inverse-variance weighted approach.
Table 1 summarizes the characteristics of the cohorts included in the analyses. The percentage of variance explained by previously published 25(OH)D GWAS identified SNPs from the gene regions encoding GC (rs2282679), CYP2R1 (rs2060793), CYP24A1 (rs6013897) and DHCR7/NADSYN1 (rs3829251) was 4.9% in a crude model and 5.2% after conditioning on other known 25(OH)D associated covariates. The known 25(OH)D associated covariates themselves explained approximately 18% of the variance observed in circulating 25(OH)D (Table 2).
The polygenic analysis assumes additive and equal contribution of each locus to the polygenic score. Crude models containing the polygenic score alone demonstrated that polygenic scores comprised of genome-wide significant SNPs (p-value < 10−7 in the training set) explained the largest variance in circulating 25(OH)D across all testing sets. Averaging across all testing sets, the models comprised of 1 or 2 SNPs with p-value < 10−7, explained an average of 2.4% of the variance in 25(OH)D (Table 3). Previously published GWAS identified 25(OH)D associated SNPs were contained in each of these scores, with rs2282679 in the GC region contained in every score, and rs10500804 in CYP2R1 contained in the score generated for the ATBC testing set.
The known 25(OH)D associated covariates in the five studies, explain approximately 16.0% of the variance observed in circulating 25(OH)D. After conditioning on the known 25(OH)D associated covariates, the average of the residual variance explained in circulating 25(OH)D across all testing sets showed that a score comprised of 1-2 SNPs explained only an additional 2.6% of the variance in circulating 25(OH)D (Table 3).
When we applied an r2 threshold for imputation of 0.3, removed those SNPs with an LD>0.8 and those SNPs within 1 Mb of the known genome-wide significant SNPs, we lost all SNPs with a p- value ≤10−7. In this case, the best performing score across all testing sets was that comprised of SNPs with a p-value ≤10−2 that explained approximately 0.27% of the variance observed in circulating 25(OH)D, and 0.45% after conditioning on the genome-wide significant SNPs and 25(OH)D associated covariates (Table 4).
By employing a linear-mixed-model to estimate the variance in residual 25(OH)D explained by an additive genetic matrix created from all the SNPs genotyped and tagged by the genome-wide platform, we observed that the proportion of the variance explained in circulating 25(OH)D ranged from 8.8% (95% CI: 0 – 54.4%) in ATBC to 51.8% (95% CI: 0 – 100%) in BRCA. Meta-analysis of the five cohorts resulted in an genetic effect estimate of the proportion of additional variance in 25(OH)D of 8.9% (95% CI: 0 - 21.7%) across the cohorts after conditioning on known 25(OH)D-associated covariates.
In these analyses of circulating 25(OH)D we observed that a polygenic score comprised of the fewest number of SNPs explained a larger proportion of variance in circulating 25(OH)D, compared with polygenic scores comprised of thousands of SNPs. The proportion of variance in circulating 25(OH)D explained by the polygenic score comprised of the fewest number of genome-wide significant SNPs was comparable to that explained by GWAS identified SNPs, at approximately 5%. By employing a linear-mixed-model utilizing all the SNPs genotyped and tagged by the genome wide platform simultaneously, there was no statistically significant improvement in the proportion of variation in 25(OH)D explained by utilizing whole genome SNPs simultaneously.
Circulating 25(OH)D is a complex trait for which family studies have estimated heritability ranging from 43% to 80%[Hunter, et al. 2001; Orton, et al. 2008; Shea, et al. 2009; Wjst, et al. 2007]. Further elucidation of the genetic architecture of this complex trait beyond environmental determinants of 25(OH)D, has the potential for identifying those at risk of vitamin D insufficiency. It may also provide a useful proxy for lifetime vitamin D exposure that may be applied in instrumental variable analyses investigating the association of vitamin D and common, complex diseases. However, as we demonstrated, known GWAS associated SNPs explain only a fraction of the observed variance in circulating 25(OH)D at about 5%.
Prior studies have observed polygenic architecture in a number of complex phenotypes such as height, bipolar disorder, schizophrenia, coronary heart disease, type I and type II diabetes, multiple sclerosis and rheumatoid arthritis [Bush, et al.; Evans, et al. 2009; Lango Allen, et al. 2010; Purcell, et al. 2009; Stahl, et al. 2012], but many of these estimates based on GWAS data still fall short of the estimates of heritability derived from family based studies. We did not find that a polygenic score comprised of hundreds or thousands of SNPs explained variability in circulating 25(OH)D over and above that explained by oligogenetic models comprised of SNPs previously identified in GWAS. Even when applying more liberal thresholds for imputation and removal of SNPs in LD, once the genome-wide significant SNP regions were excluded, the remaining SNPs did not improve our estimates of the variance in circulating 25(OH)D. Our results are consistent with previous studies of breast, prostate and pancreatic cancer which found that polygenic risk scores explained only a small fraction of variation in disease risk. [Machiela, et al. 2011; Pierce, et al. 2012; Witte and Hoffmann 2011].
The linear-mixed-model estimates of the percent variance in 25(OH)D explained by GWAS-tagged SNPs were highly variable due to the relatively small sample size[Zaitlen and Kraft 2012]. The meta-analysis of these estimates from our cohorts (8.9%) was consistent with the estimate of percent variance explained by GWAS-significant SNPs alone (4.9%); the 95% confidence interval for the percent variance explained by tag-SNPs measured in GWAS excluded effects larger than 25%.
Potential limitations of this study include the fact that we have restricted our potential SNPs for inclusion to directly genotyped or imputed Illumina SNPs. The polygenic models assume an additive allelic effect and does not account for potential gene-gene interactions. The study was also limited by small sample size. Despite employing techniques to utilize SNPs that do not individually reach genome-wide significance, sample sizes on the order of 10,000s of subjects may be required to obtain estimates with reasonable confidence[Daetwyler, et al. 2008; Lango Allen, et al. 2010; Yang, et al. 2010]
This is the first study of 25(OH)D that employed a variety of methods utilizing common SNP data to refine the estimate of variability explained by genetics. Although known GWAS associated SNPs account for the largest proportion of variance in circulating 25(OH)D, it is still only a small fraction of the genetic variance expected based on family studies. This may indicate that the majority of the genetic effect for circulating 25(OH)D is due to rare variants, structural variants other than SNPs, epistasis, or gene-environment interaction [Maher 2008; Manolio, et al. 2009]. It may also reflect biases in the estimates of heritability[Zuk, et al. 2012]. Acknowledging the limitations of utilizing common SNP information, our findings still provide a useful proxy for circulating 25(OH)D, unconfounded by environmental determinants of 25(OH)D, for future studies of complex diseases.
We thank P. Soule for assistance, and we thank the participants in the Nurses’ Health Studies (NHS) and Health Professionals Follow-up Study (HPFS).
This work was supported by the Canadian Institute of Health Research [Health Professionals Fellowship Award to L.H.].
The NHS Cancer Genetic Markers of Susceptibility breast cancer GWAS was supported by N01-CO-12400. The NHS/HPFS Type 2 Diabetes GWAS (U01HG004399) is a component of a collaborative project that includes 13 other GWAS funded as part of the Gene-Environment Association Studies (GENEVA) under the National Institutes of Health (NIH) Genes, Environment, and Health Initiative (GEI). Genotyping was performed at the Broad Institute of the Massachusetts Institute of Technology and Harvard, with funding support from the NIH GEI (U01HG04424). The NHS/HPFS CHD GWAS was supported by HL35464 and CA55075 from the NIH with additional support for genotyping from Merck/Rosetta Research Laboratories, North Wales, PA. The NHS is supported by NIH grants P01CA087969 and 5U01HG004399-2, and the HPFS is supported by NIH grant P01CA055075.
The ATBC work was supported by the Intramural Research Program of the National Cancer Institute at the National Institutes of Health. Additionally, this research was supported by U.S. Public Health Service contracts N01-CN-45165, N01-RC-45035, N01-RC-37004, and HHSN261201000006C from the National Cancer Institute, Department of Health and Human Services.
Conflict of Interest Statement The authors declare no conflict of interest.