MESA is a population-based study of 6814 men and women aged 45 to 84 years, free of known CVD at baseline, recruited between 2000 and 2002 from six US communities. The main objective of the study is to determine the characteristics of subclinical cardiovascular disease and its progression. Details of the objectives and design of MESA have been published.10
Institutional review board approval was obtained at all MESA sites and all participants gave their informed consent.
For initial genotyping analyses, a subcohort of 720 subjects was selected from the total MESA cohort of 6814, who both gave informed consent for DNA extraction and genetic sub-study; and had samples in the study DNA laboratory with sufficient DNA. All DNA was of high quality as measured by OD260/OD280, with the mean ratio of 1.77. All DNA was of high molecular weight as determined by gel electrophoresis. For the current study, to minimize the possibility of spurious genetic associations due to population stratification, we analyzed the data from the MESA participants self-identified as having European ancestry. Priority was given to subjects who participated in the MESA 3 additional blood biomarker collection, supplemented by random selection from remaining participant samples to fulfill balanced ethnic group representation and equality by gender. CAC was determined with electron beam or helical CT11
. The average Agatston score of two scans was used and the presence of CAC was defined as an Agatston score > 0.
DNA was extracted from peripheral leukocytes isolated from packed cells of anticoagulated blood by use of a commercially available DNA isolation kit (Puregene; Gentra Systems, Minneapolis, MN). The DNA was quantified by determination of absorbance at 260 nm followed by PicoGreen analysis (Molecular Probes, Inc., Eugene, OR). Two vials of DNA were stored per participant at −70 degrees centigrade and subsequently aliquoted for use.
MESA investigators proposed candidate genes for two separate gene marker panels (MESA Candidate Gene Panel 1 and 2), and the genes were priority ranked by contributing investigators. The list of genes included is shown in Supplementary Table 1
. For Panel 2, additional weight was given to genes proposed by the MESA Eye ancillary study, a study of retinal microvascular characteristics as predictors of subclinical and clinical cardiovascular diseases. Final priority for both panels was assigned by the MESA Family Study Genetics Committee. SNPs for the chosen genes were selected according to the following criteria. Firstly, SNPs within the proximal and distal 10-kilobase regions 5′ and 3′ to the given candidate gene (NCBI Build 35) were chosen. Next, SNP compatibility with the Illumina GoldenGate technology12,13
as determined by the Assay Design Tool (TechSupport, Illumina, San Diego, CA) was required. Finally, SNPs with minor allele frequency (MAF) greater or equal to 0.05 or a tag (r2
value at least 0.8) for another SNP with MAF > 0.05 as determined by applying the multi-locus or “aggressive” “Tagger” option of Haploview v314,15
using International HapMap project data for CEPH and Yoruban populations (release 19)16
were selected. Due to these competing criteria, a complete set of tagSNPs could not be found for some genes, and additional SNPs were selected from one of the following three sources. 1) LDselect analysis of resequencing information from the Seattle SNPs project if available17,18
; 2) Non-synonymous SNPs from dbSNP (release 124)19
; 3) SNPs with prior report of association with a phenotype similar or identical to one measured in MESA and proposed by a MESA investigator.
In MESA CG Panel 1, ancestry informative markers (AIMs) were selected from an Illumina proprietary SNP database to maximize the difference in allele frequencies between any pair of ethnic groups: Caucasian- vs African-American; Caucasian- vs Chinese-American; African- vs Chinese-American. For MESA CG Panel 2, additional makers informative for Mexican-American ancestry were selected from published lists20,21
Genotyping was performed by Illumina Genotyping Services (Illumina Inc., San Diego, CA) using their proprietary GoldenGate assay. The SNPs were typed in two separate panels of 1536 markers, selected to assay multiple phenotype x gene hypotheses. Illumina performed initial quality control in their laboratory to identify samples and SNPs that failed genotyping according to proprietary protocols, and sporadic failed genotypes with gencall quality score <0.25. Of 156 duplicate pairs included in 33 plates of samples typed, Illumina were blinded to 92 pairs. Both unblinded and blinded sample replicate concordance rates were > 99.99%. After removal of failed SNPs and samples, the genotype calling rate was 99.93%, with maximum missing data rate per sample of 2.1%, and maximum missing data per SNP of 4.98%. The cohort genetic data was checked for cryptic sample duplicates and discrepancies in genetically predicted sex (using X markers) versus study database reported sex. Samples with unresolved duplicate and sex discrepancies were removed from the genetic study database.
These criteria resulted in 712 individuals of European descent genotyped at 2882 SNPs. 393 (55.2%) of the cohort was previously determined to have detectable coronary artery calcification at the baseline exam. Missing SNP values in the genotype data were imputed by random assignment according to the marginal frequencies of each SNP across the cohort, in an effort to favor the null hypothesis of no association as much as possible. 593 of the SNPs were removed for having minor allele frequencies lower than 5%. A further 111 autosomal SNPs were rejected because the distribution of their alleles violated Hardy-Weinberg equilibrium (p<0.05)22
. These tests resulted in a total of 2177 SNPs being eligible for our model search algorithms.
Additional non-genetic data were available for each individual, and we selected the following for inclusion in our model-building process, on the basis of suspected associations with atherosclerosis or CVD: Age, Sex, Education Level, Income, Smoking Status, Weight, Body Mass Index (BMI), Diabetes, LDL Cholesterol, HDL Cholesterol, Total Cholesterol, Systolic Blood Pressure, Diastolic Blood Pressure, Hypertension, Walking Speed, and Minutes of Exercise per Week. Income and Education are included as potential proxies for other unmeasured environmental factors. In all categories, responses were binned into at most five discrete values using a simple linear binning strategy (different binning strategies did not affect results).
We then sorted all remaining variables by their Bayes factor23
, a statistic that assesses the likelihood increase associated with conditioning the outcome (presence of coronary calcification) upon the variable in question. A Bayes factor > 1 (or equivalently, the log of the Bayes factor > 0) for a particular variable indicates that the variable in question is more likely to be probabilistically associated with coronary calcification than to be probabilistically independent of coronary calcification. Bayes factors were computed using conjugate Dirichlet priors as described in Cooper and Herskovitz24
. We use the Bayes factors to filter the number of SNPs to a tractable amount: only 50 SNPs had log Bayes factor > 0, and the 17 SNPs with highest Bayes factors were all on the X chromosome. We also note that ten of the clinical variables had log Bayes factor > 0 (age, sex, diabetes, BPS, hypertension, HDL cholesterol, smoking, BMI, total cholesterol, and education). All SNPs and clinical variables with log Bayes factor > 0 are listed in Supplementary Table 2
. We used Bayesware Discoverer (www.bayesware.com
) to learn a Bayesian network structure on these variables. A Bayesian network is a statistical method with properties that ideally suit it to the task of discovering predictive models of complex diseases25,26
. Among these is the ability of the model marginal likelihood to penalize model complexity without need of further adjustment.27
Indeed, Bayesian networks have proved successful in previous prediction problems of complex diseases, including stroke28
. We built three separate Bayesian network models: one using the 50 positive-Bayes factor SNPs alone; another using all 16 clinical variables alone; and a third using the combination of the 50 SNPs and 16 clinical variables.
We compared performance of the three different models using the area under the convex hull of the receiver operating characteristic curve (AUC), which measures performance of a classifier at different false positive/false negative thresholds. We then followed the prescription of Lasko et al.29
to compute a p-value comparing the difference of two AUCs, computing the variance of an AUC according to the nonparametric method described by DeLong et al.30