We conducted a GWAS of prospectively collected 25(OH)D in five studies with a total of 5,575 individuals (). The five GWAS were: a case-control study of breast cancer (BRCA) and a case-control study of type 2 diabetes (T2D), both nested within the Nurses’ Health Study (NHS)[Willett, et al. 1987
], a case-control study of myocardial infarction (MI) nested within the Health Professionals Follow-up Study (HPFS)[Giovannucci, et al. 2008
], and previously selected case-control sets from prior studies which included subjects from the Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study (ATBC)
and the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO)[Hayes, et al. 2005
Characteristics of GWAS cohorts
NHS began in 1976 when 121,700 female US registered nurses 30 to 55 years of age were enrolled and are contacted every 2 years by questionnaire with response rates exceeding 90% per cycle.[Colditz, et al. 1986
]. Between 1989 and 1990, 32,826 women between the ages of 43 and 69 provided blood samples. Of those, we included participants with GWAS data and plasma 25(OH)D from nested case-control studies of breast cancer (n=870) [Willett, et al. 1987
] and T2D (n=721)[Qi, et al. 2010
]. Controls were matched to cases on age and month and year of blood draw.
HPFS began in 1986 with 51,529 US male health professionals, ages 40-75 years, providing baseline demographic and medical information, updated every two years [Koushik, et al. 2006
; Rimm, et al. 1992
]. Blood samples were collected from 18,225 of the HPFS participants from 1993 to1995. We included in our analysis 1245 HPFS participants with GWAS data and pre-disease diagnosis plasma 25(OH)D from a case-control study of MI[Giovannucci, et al. 2008
The ATBC study was a randomized, placebo-controlled cancer prevention intervention trial of supplementation with alpha-tocopherol, beta-carotene or both conducted in southwestern Finland from 1985 to 1993[Group 1994
]. Participants were all 50-69 year old male smokers. Fasting blood samples used for 25(OH)D measurement were ascertained at study entry andwhole blood collection (DNA source) was done between 1992-1993. GWAS and circulating 25(OH)D levels data were available for 1372 men[Ahn, et al. 2010
The PLCO study was a large, randomized multi-center trial designed to evaluate the effectiveness of cancer screening and examine early markers of cancer, with approximately 155,000 male and female U.S. residents aged 55 to 74 at baseline were recruited in 1993 to 2001 and all screening-arm subjects provided non-fasting blood samples at baseline [Hayes, et al. 2005
]. A total of 1316 Caucasian subjects with GWAS data and serum measurements of 25(OH)D from a nested case-control study of prostate cancer were included in these analyses[Ahn, et al. 2008
25(OH)D concentrations for BRCA plasma samples were measured by radioimmunoassay (RIA)[Hollis 1997
] in three batches, two in Dr. Michael Holick’s laboratory at Boston University School of Medicine (coefficients for variation (CV) 8.7 – 17.6%) and a third in Dr. Bruce Hollis’ laboratory at the Medical University of South Carolina in Charlestown, SC (CV 8.7%) [Ahn, et al. 2010
; Bertone-Johnson, et al. 2005
]. For the T2D study, plasma levels of 25(OH)D were measured in the Nutrition Evaluation Laboratory in the Human Nutrition Research Center on Aging at Tufts University (CV 8.7%) by rapid extraction followed by equilibrium I-125 RIA procedure as specified by the manufacturer’s procedural documentation and analyzed by gamma counter (Cobra II, Packard)[Ahn, et al. 2010
]. 25(OH)D concentrations for the MI plasma samples were measured by RIA in Dr. Hollis’ lab[Giovannucci 2005
; Hollis, et al. 1993
]. For ATBC and PLCO serum 25(OH)D levels were measured by competitive chemiluminescence immunoassay (CLIA) at Heartland Assays, Inc. by Dr. Ron Horst, with coefficients of variation (CVs) in the blinded duplicate quality-control samples of 9.3% (intrabatch) and 12.7% (interbatch) for ATBC [Ahn, et al. 2010
; Gallicchio, et al. 2010
] and 8.1% for PLCO.
GWAS genotyping was performed using the Illumina 550K (or higher density) platform in the BRCA, PLCO and ATBC studies and the Affymetrix 6.0 platform in the T2D and MI studies. Quality-control assessment of genotypes including sample completion rates, SNP call rates, concordance rates, deviation from Hardy-Weinberg proportions in control DNA and final sample selection for associational analysis are described in detail elsewhere [Ahn, et al. 2010
; Easton, et al. 2007
; Hunter, et al. 2007
; Qi, et al. 2010
; Thomas, et al. 2009
]. Genotypes for markers contained on the Illumina 550K platform and not the Affymetrix 6.0 platform were imputed using a Markov Chain based Haplotyper (MACH) and HapMap phase II CEU reference panel (Rel 22), with retention of only high quality SNPs with a MACH-r2 of 0.95 or greater in all data sets. After restricting to SNPs either directly genotyped or imputed to Illumina, we had a total of 359,928 SNPs eligible for inclusion in a polygenic score.
The percentage of variance explained by the 25(OH)D GWAS identified SNPs from the gene regions encoding GC, CYP2R1, CYP24A1 and DHCR7/NADSYN1 was estimated among subjects from the BRCA, T2D, HPFS, PLCO and ATBC studies. Residual sums of squares for the score itself (r2 = variance of the model/total variance) were estimated by linear regression for models containing the four SNPs alone; the SNPs along with 25(OH)D associated covariates (age, BMI, case-control status, region of residence, season of blood draw, quartiles of vitamin D supplement intake, quintiles of vitamin D intake from food sources and two eigenvectors from principal components); and covariates alone. These sums of squares were then used to estimate the proportion of the variance explained by the SNPs after removing variation explained by other covariates, in the form of the adjusted r2 (adjusted r2 = [SSresid(covar) - SSresid(covar+score)]/SSresid(covar))
To create the polygenic scores we first conducted GWAS of 25(OH)D in each of the five studies separately. GWAS adjusted for age, case-control status, sex, BMI (kg/m2
), season of blood collection, vitamin D dietary intake, vitamin D supplement intake, and eigenvectors to control for population stratification, with the BRCA, T2D, MI and PLCO cohorts additionally adjusting for US region of residence. These individual study results were then meta-analyzed using an inverse variance weighted approach, as implemented in the software METAL[Ahn, et al. 2010
; Willer, et al.].
Using the results of the meta-analysis, SNPs were ranked by decreasing significance (i.e., ascending p-value) for the association with 25(OH)D. An in-house Python script that references the HapMap Rel.23a (NCBI B36) database was used to remove those SNPs in linkage disequilibrium (LD >0.2). The number of SNPs meeting specified p-value cut offs (p ≤1.0 × 10−7
, ≤1.0 × 10−5
, ≤1.0 × 10−3
, ≤0.01, ≤0.05, ≤0.1, ≤0.15, ≤0.2, ≤0.5) was determined. PLINK software[Purcell, et al. 2007
] was then used to construct 9 polygenic scores based on the SNPs corresponding to each cut-off value. The polygenic score is an allelic scoring system involving specified SNPs and identified risk alleles (allele associated with increased circulating 25(OH)D), to assign a single quantitative index of genetic risk to each subject. A polygenic score assumes an additive and identical allelic effect for each variant (not dominant or recessive), and no gene-gene interactions.
Five-fold cross validation of the cohorts was completed testing the performance of 9 potential polygenic scores corresponding to specified p-value cut-offs. To conduct cross-validation, training sets were created by meta-analyzing 4 of the 5 cohorts using METAL software[Willer, et al.], leaving the remaining cohort as the testing set. The polygenic score was constructed using the method described above, and the association between the score and circulating 25(OH)D is estimated in the testing set. Of the 9 potential polygenic score models, the model with the highest associated r2 value averaged across the testing sets was taken to be the model explaining the most variance in circulating 25(OH)D.
We repeated the analyses applying the MACH r2 threshold for imputation of 0.3, resulting in a total of 506,671 SNPs eligible for inclusion in a polygenic score. From these SNPs we removed those SNPs in linkage disequilibrium (LD >0.8) and excluded those SNPs within 1 Mb of the four known genome-wide significant SNPs.
SAS/STAT® software (SAS Institute Inc., Cary, NC) was used to conduct a linear regression analysis in each of the 5 testing sets for each of the 9 polygenic score models, and estimate the proportion of the total variation in circulating 25(OH)D due to the score in the crude analysis. We also estimated the proportion of the variance explained by the score after removing variation explained by other covariates including age, BMI, case-control status, region of residence, season of blood draw, quartiles of vitamin D supplement intake, quintiles of vitamin D intake from food sources and two eigenvectors from principal components.
Individual cohorts were analyzed with software for linear-mixed-model analysis using the GCTA software tool (0.91.0). We used genotyped SNPs meeting quality control measures (BRCA: 529,423 SNPs; T2D: 678,082 SNPs; MI: 697,899 SNPs; ATBC: 530,324 SNPs; PLCO:514,636 SNPs) to create a genetic relationship matrix, which uses the restricted maximum likelihood (REML) method to fit a linear-mixed-model to estimate the variance in residual 25(OH)D explained by additive genetic matrix created from these SNPs. The covariates listed above are included as fixed effects and we meta-analyzed the results from the 5 cohorts employing an inverse-variance weighted approach.