|Home | About | Journals | Submit | Contact Us | Français|
Many different genetic and clinical factors have been identified as causes or contributors to atherosclerosis. We present a model of preclinical atherosclerosis based on genetic and clinical data that predicts the presence of coronary artery calcification in healthy Americans of European descent aged 45 to 84 in the Multi-Ethnic Study of Atherosclerosis (MESA).
We assessed 712 individuals for the presence or absence of coronary artery calcification, and their genotypes for 2882 single-nucleotide polymorphisms (SNPs). Using these SNPs and relevant clinical data, a Bayesian network that predicts the presence of coronary calcification was constructed. The model contains 13 SNPs (from genes AGTR1, ALOX15, INSR, PRKAB1, IL1R2, ESR2, KCNK1, FBLN5, PPARA, VEGFA, PON1, TDRD6, PLA2G7, and one ancestry informative marker) and 5 clinical variables (sex, age, weight, smoking, and diabetes) and achieves 85% predictive accuracy, as measured by area under the ROC curve (AUC). This is a significant (p < 0.001) improvement upon models using just the SNP data or using just the clinical variables.
We present an investigation of joint genetic and clinical factors associated with atherosclerosis that shows predictive results for both cases, and enhanced performance for the combination.
Atherosclerosis is a complex disease with many possible causes, and many adverse clinical outcomes, including heart disease, myocardial infarction, stroke, embolism, thrombosis, and aneurysm. To date, many different genetic1 and clinical factors2 have been identified as causes or contributors to atherosclerosis. We present herein a model of preclinical atherosclerosis based on genetic and clinical data that predicts the presence of coronary artery calcification in healthy Americans of European descent aged 45 to 84.
There has been much progress in identifying genes and single nucleotide polymorphisms (SNPs) associated with atherosclerosis3, or clinical correlates of atherosclerosis (for example, contributing causes like hypertension4 and consequences like myocardial infarction5). There have also been successful predictive models of effects of atherosclerosis, including heart disease6 and thromboembolism7, and these particular models were successful without including genetic information. Additionally, one study considered 30 possible genetic markers8 in a predictive model of atherosclerosis, but concluded that the few SNPs available were unhelpful in predicting atherosclerosis. It seems clear that to achieve a more complete model of atherosclerosis, a multitude of genes and clinical variables must be combined.9
The Multi-Ethnic Study of Atherosclerosis (MESA) recruited a cohort of individuals without clinical evidence for cardiovascular disease (CVD), and then measured coronary artery calcification (CAC), a quantifiable marker of advanced atherosclerosis.10 A large number of candidate genes, 231, related to vascular disease and related phenotypes were selected for genotyping, and extensive clinical and demographic data collected on the subjects. With these combined data, MESA provides an ideal opportunity to investigate both clinical and genetic models of CAC as an indicator of atherosclerosis.
MESA is a population-based study of 6814 men and women aged 45 to 84 years, free of known CVD at baseline, recruited between 2000 and 2002 from six US communities. The main objective of the study is to determine the characteristics of subclinical cardiovascular disease and its progression. Details of the objectives and design of MESA have been published.10 Institutional review board approval was obtained at all MESA sites and all participants gave their informed consent.
For initial genotyping analyses, a subcohort of 720 subjects was selected from the total MESA cohort of 6814, who both gave informed consent for DNA extraction and genetic sub-study; and had samples in the study DNA laboratory with sufficient DNA. All DNA was of high quality as measured by OD260/OD280, with the mean ratio of 1.77. All DNA was of high molecular weight as determined by gel electrophoresis. For the current study, to minimize the possibility of spurious genetic associations due to population stratification, we analyzed the data from the MESA participants self-identified as having European ancestry. Priority was given to subjects who participated in the MESA 3 additional blood biomarker collection, supplemented by random selection from remaining participant samples to fulfill balanced ethnic group representation and equality by gender. CAC was determined with electron beam or helical CT11. The average Agatston score of two scans was used and the presence of CAC was defined as an Agatston score > 0.
DNA was extracted from peripheral leukocytes isolated from packed cells of anticoagulated blood by use of a commercially available DNA isolation kit (Puregene; Gentra Systems, Minneapolis, MN). The DNA was quantified by determination of absorbance at 260 nm followed by PicoGreen analysis (Molecular Probes, Inc., Eugene, OR). Two vials of DNA were stored per participant at −70 degrees centigrade and subsequently aliquoted for use.
MESA investigators proposed candidate genes for two separate gene marker panels (MESA Candidate Gene Panel 1 and 2), and the genes were priority ranked by contributing investigators. The list of genes included is shown in Supplementary Table 1. For Panel 2, additional weight was given to genes proposed by the MESA Eye ancillary study, a study of retinal microvascular characteristics as predictors of subclinical and clinical cardiovascular diseases. Final priority for both panels was assigned by the MESA Family Study Genetics Committee. SNPs for the chosen genes were selected according to the following criteria. Firstly, SNPs within the proximal and distal 10-kilobase regions 5′ and 3′ to the given candidate gene (NCBI Build 35) were chosen. Next, SNP compatibility with the Illumina GoldenGate technology12,13 as determined by the Assay Design Tool (TechSupport, Illumina, San Diego, CA) was required. Finally, SNPs with minor allele frequency (MAF) greater or equal to 0.05 or a tag (r2 value at least 0.8) for another SNP with MAF > 0.05 as determined by applying the multi-locus or “aggressive” “Tagger” option of Haploview v314,15 using International HapMap project data for CEPH and Yoruban populations (release 19)16 were selected. Due to these competing criteria, a complete set of tagSNPs could not be found for some genes, and additional SNPs were selected from one of the following three sources. 1) LDselect analysis of resequencing information from the Seattle SNPs project if available17,18; 2) Non-synonymous SNPs from dbSNP (release 124)19; 3) SNPs with prior report of association with a phenotype similar or identical to one measured in MESA and proposed by a MESA investigator.
In MESA CG Panel 1, ancestry informative markers (AIMs) were selected from an Illumina proprietary SNP database to maximize the difference in allele frequencies between any pair of ethnic groups: Caucasian- vs African-American; Caucasian- vs Chinese-American; African- vs Chinese-American. For MESA CG Panel 2, additional makers informative for Mexican-American ancestry were selected from published lists20,21.
Genotyping was performed by Illumina Genotyping Services (Illumina Inc., San Diego, CA) using their proprietary GoldenGate assay. The SNPs were typed in two separate panels of 1536 markers, selected to assay multiple phenotype x gene hypotheses. Illumina performed initial quality control in their laboratory to identify samples and SNPs that failed genotyping according to proprietary protocols, and sporadic failed genotypes with gencall quality score <0.25. Of 156 duplicate pairs included in 33 plates of samples typed, Illumina were blinded to 92 pairs. Both unblinded and blinded sample replicate concordance rates were > 99.99%. After removal of failed SNPs and samples, the genotype calling rate was 99.93%, with maximum missing data rate per sample of 2.1%, and maximum missing data per SNP of 4.98%. The cohort genetic data was checked for cryptic sample duplicates and discrepancies in genetically predicted sex (using X markers) versus study database reported sex. Samples with unresolved duplicate and sex discrepancies were removed from the genetic study database.
These criteria resulted in 712 individuals of European descent genotyped at 2882 SNPs. 393 (55.2%) of the cohort was previously determined to have detectable coronary artery calcification at the baseline exam. Missing SNP values in the genotype data were imputed by random assignment according to the marginal frequencies of each SNP across the cohort, in an effort to favor the null hypothesis of no association as much as possible. 593 of the SNPs were removed for having minor allele frequencies lower than 5%. A further 111 autosomal SNPs were rejected because the distribution of their alleles violated Hardy-Weinberg equilibrium (p<0.05)22. These tests resulted in a total of 2177 SNPs being eligible for our model search algorithms.
Additional non-genetic data were available for each individual, and we selected the following for inclusion in our model-building process, on the basis of suspected associations with atherosclerosis or CVD: Age, Sex, Education Level, Income, Smoking Status, Weight, Body Mass Index (BMI), Diabetes, LDL Cholesterol, HDL Cholesterol, Total Cholesterol, Systolic Blood Pressure, Diastolic Blood Pressure, Hypertension, Walking Speed, and Minutes of Exercise per Week. Income and Education are included as potential proxies for other unmeasured environmental factors. In all categories, responses were binned into at most five discrete values using a simple linear binning strategy (different binning strategies did not affect results).
We then sorted all remaining variables by their Bayes factor23, a statistic that assesses the likelihood increase associated with conditioning the outcome (presence of coronary calcification) upon the variable in question. A Bayes factor > 1 (or equivalently, the log of the Bayes factor > 0) for a particular variable indicates that the variable in question is more likely to be probabilistically associated with coronary calcification than to be probabilistically independent of coronary calcification. Bayes factors were computed using conjugate Dirichlet priors as described in Cooper and Herskovitz24. We use the Bayes factors to filter the number of SNPs to a tractable amount: only 50 SNPs had log Bayes factor > 0, and the 17 SNPs with highest Bayes factors were all on the X chromosome. We also note that ten of the clinical variables had log Bayes factor > 0 (age, sex, diabetes, BPS, hypertension, HDL cholesterol, smoking, BMI, total cholesterol, and education). All SNPs and clinical variables with log Bayes factor > 0 are listed in Supplementary Table 2. We used Bayesware Discoverer (www.bayesware.com) to learn a Bayesian network structure on these variables. A Bayesian network is a statistical method with properties that ideally suit it to the task of discovering predictive models of complex diseases25,26. Among these is the ability of the model marginal likelihood to penalize model complexity without need of further adjustment.27 Indeed, Bayesian networks have proved successful in previous prediction problems of complex diseases, including stroke28. We built three separate Bayesian network models: one using the 50 positive-Bayes factor SNPs alone; another using all 16 clinical variables alone; and a third using the combination of the 50 SNPs and 16 clinical variables.
We compared performance of the three different models using the area under the convex hull of the receiver operating characteristic curve (AUC), which measures performance of a classifier at different false positive/false negative thresholds. We then followed the prescription of Lasko et al.29 to compute a p-value comparing the difference of two AUCs, computing the variance of an AUC according to the nonparametric method described by DeLong et al.30
Table 1 shows demographic data for the study participants. We developed a Bayesian network using all available data; the Markov blanket of CAC in the network is depicted in Figure 1. It uses all the variables listed in Table 2 except for SNP rs2380316, on the X-chromosome (variable number 18 in Table 2): five clinical variables and 13 SNPs. This model has an 85% AUC on the entire European MESA cohort for predicting CAC presence (Figure 2).
For comparison purposes, we then obtained a Bayesian network using only the genomic data, and the 14 SNPs used in this network are shown in Table 2 (numbers 6-17,19). The model consists of 14 SNPs directly related to the phenotype (CAC); there are no other connections. This model differs from the first by including an extra SNP on the X chromosome, rs2380316, which may act as a proxy for the absent sex attribute. This model achieved an area under the curve of 77% AUC (Figure 2). The combined SNP-clinical model's performance was significantly better than the SNP-only model (p-value < 0.001), using the method of DeLong et al.30
We then compared our results against a Bayesian network constructed from only the clinical data. This model included the features: Sex, Age, Diabetes, Smoking, and Weight (Figure 3). Other attributes did not impact the presence of CAC according to the model. Nevertheless, this model achieved comparable accuracy to the SNP-only model: 78.3% AUC (Figure 2). These were tested for statistical difference from the results of the combined SNP-clinical model and found to be significant (p-value < 0.001, DeLong et al.'s method30), while these results were not significantly different from the performance of the SNP-only model. These observations are summarized in Table 3.
Finally, we attempted to construct models using logistic regression with stepwise selection, of the same types as our Bayesian networks. A logistic regression model constructed from the 50 SNPs and 16 clinical attributes achieved 81.5% AUC (p = .036 vs. 85% AUC of our Bayesian model), using the attributes age, sex, diabetes, smoking, total cholesterol, and six SNPs (numbers 11 and 16 in table 2, an X-linked SNP rs953114, and three SNPs on genes included in the Bayesian model: rs1403543 on AGTR2; rs2498852 at FBLN5; and rs6502804 in ALOX15.) A logistic regression model using only clinical variables performed less well, with an AUC of 78.9% (not significantly different from the Bayesian clinical-only model, p = 0.256), but other researchers have previously obtained a logistic model on the entire European MESA cohort (n = 2619) with AUC of 82%2. Logistic regression limited to using SNPs had predictive accuracy of 66.6% AUC and was significantly worse (p < 1.77 *10−7) than the Bayesian SNP-only model (this regression model contained eight SNPs: two X-linked; two from the combined logistic model, rs2498852 at FBLN5 and rs6502804 on ALOX15; and four listed in table 2, numbers 8, 11, 16, and 17.)
Our results show that it is possible to combine genetic data with clinical data and achieve improvements in predictive accuracy. The performance of the combined model (85% AUC) is comparable or better than existing predictors of atherosclerosis. Previous studies have either investigated related but different phenotypes (myocardial infarction, thrombosis, etc.) or had different phenotyping criteria (work by Chen et al.9 checks for a 50% narrowing of any coronary artery – a separate indicator of atherosclerosis, where narrowing of an artery can be due to uncalcified plaques.)
We note that multi-attribute methods are required in order to achieve a well-performing predictive model of atherosclerosis. We tested each individual attribute involved as a single predictor of CAC (summarized in Table 2). Individual attributes were poor predictors of CAC, except for Sex and Age; other attributes only achieved AUCs in the range of 53% to 59%. Table 2 also lists p-values for difference between the performance of a single attribute predictor and a random classifier, and 10 of these are significant at a 0.05-threshold.
The network figure 1 shows that 13 SNPs, located on 12 genes and one AIM, modulate the risk of coronary artery calcification. Note that although ancestry informative markers were included in the genotyping for purposes related to other investigations, it nevertheless turns out that an AIM is included in the model.
Since the SNPs genotyped in the MESA study were chosen as tag SNPs, and these chosen for their candidacy in relation to previous knowledge about the genetics of atherosclerosis and cardiovascular disease, these 12 SNPs represent various connections and relations to atherosclerosis and related physiology. Four of these 12 genes fall into a coherent functional picture related to lipid metabolism, centered on the adipocytokine signaling pathway (PRKAB131, AMPK31, PPARA32, and AMPK, along with INSR, is also related to the insulin signaling pathway33), while the forth is related to HDL- and LDL- bound cholesterols (PON1.34) Five genes represented in our model could be broadly grouped by their relation to vascular inflammation or constriction (AGTR135, VEGFA36), or general inflammation responses (ALOX1537, IL1R238, and PLA2G739). Several genes have previously been associated with atherosclerosis (PON140, ALOX1541, and FBLN5 and ESR242-44). The twelfth gene (KCNK1), a ubiquitous potassium ion channel, and has been linked to heart cells but not vasoconstriction, with differential expression in ventricular and atrial cells.45
Our work here should be considered in light of its scope and limits. First, since our total predictive accuracy is imperfect, we speculate that further genetic investigations will uncover better models. Genome-wide association studies of atherosclerosis, measuring hundreds of thousands of SNPs rather than just thousands, could uncover unknown dependencies between genes in our model, or entirely new genes. Equally, it may be possible that there are further clinical attributes, beyond those collected in MESA or used in our model, which may be of relevance to predicting CAC. Second, our work here is limited to the European cohort of the MESA study, while studies on other ethnic groups remain in preparation. Since we have used the MESA data, our prediction is for the presence or absence of coronary artery calcification, which is only one measure of atherosclerosis. It may be easier to predict atherosclerosis or the onset of cardiovascular disease when using different clinical phenotypes, even different CAC thresholds – CAC raw Agatston score > 400, for example, is frequently considered an indicator of increased CVD risk,46 one we did not consider because the number of MESA patients meeting this threshold is low (n = 83). Further, since atherosclerosis affects such a large proportion of the population, and risk is largely dependent on age and sex, it may be more informative to predict who will get atherosclerosis earlier or later than expected, after more data are available for the onset of atherosclerosis. Last but not least, our models have not been tested on an external replication population. Thus, there may be modifications necessary to our model to achieve robust prediction on different populations; from other studies or from other demographic groups.
This study demonstrates the improved predictive performance of joint consideration of genetic and clinical contributions to a subclinical measure of atherosclerosis. Since the predictive ability of our model is an improvement over the previous best-performing associative tests, this approach has the potential to improve individual-level prediction of CVD risk through the implementation of genetic tests. Further, the model developed in the current study may lead to improved clinical and mechanistic models of atherosclerosis progression.
Pre-symptomatic prediction of late onset diseases is one of the greatest promises of genetic medicine and at the heart of personalized medicine. This study describes a model of preclinical atherosclerosis based on genetic and clinical data. This model predicts the presence of coronary artery calcification with 85% accuracy. We assessed 712 healthy Americans of European descent aged 45 to 84 for the presence of coronary artery calcification, and their genotypes for 2882 single-nucleotide polymorphisms (SNPs), subtle variations in individuals' genomes. We used a methodology known as Bayesian networks, born at the confluence of statistics and Artificial Intelligence, to integrate this genetic data and relevant clinical information into a coherent network model that predicts the presence of coronary calcification. The model contains 13 SNPs and five clinical variables: sex, age, weight, smoking, and diabetes. This model outperforms both the model built from purely genetic data and the model built on clinical information alone. In this way, this study demonstrates the improved predictive performance of joint consideration of genetic and clinical contributions to a subclinical measure of atherosclerosis. Since the predictive ability of our model is an improvement over the previous best-performing associative tests, this approach has the potential to improve individual risk prediction of arteriosclerosis through the development of genetic tests. Further, the model presented in the current study may lead to improved clinical and mechanistic models of atherosclerosis progression and identify novel molecular targets for drug development.
The authors thank the participants, investigators, and staff of the MESA study for their valuable contributions. We thank reviewers for helpful comments on an earlier version of this manuscript. A full list of participating MESA investigators and institutions can be found at http://www.mesa-nhlbi.org.
Funding Sources: This research was supported in part by NIH grants T32-HL-007427, and U01-HL-065899, R01-HL-071205 and R21-DA-025168 and by contracts N01-HC-95159 through N01-HC-95165 and N01-HC-95169.