Genome-wide association studies (GWAS) have helped to reveal genetic mechanisms of complex diseases. Although commonly used genotyping technology enables us to determine up to a million single-nucleotide polymorphisms (SNPs), causative variants are typically not genotyped directly. A favored approach to increase the power of genome-wide association studies is to impute the untyped SNPs using more complete genotype data of a reference population.
Random forests (RF) provides an internal method for replacing missing genotypes. A forest of classification trees is used to determine similarities of probands regarding their genotypes. These proximities are then used to impute genotypes of untyped SNPs.
We evaluated this approach using genotype data of the Framingham Heart Study provided as Problem 2 for Genetic Analysis Workshop 16 and the Caucasian HapMap samples as reference population. Our results indicate that RFs are faster but less accurate than alternative approaches for imputing untyped SNPs.
Using the North American Rheumatoid Arthritis Consortium (NARAC) candidate gene and genome-wide single-nucleotide polymorphism (SNP) data sets, we applied regression methods and tree-based random forests to identify genetic associations with rheumatoid arthritis (RA) and to predict RA disease status. Several genes were consistently identified as weakly associated with RA without a significant interaction or combinatorial effect with other candidate genes. Using random forests, the tested candidate gene SNPs were not sufficient to predict RA patients and normal subjects with high accuracy. However, using the top 500 SNPs, ranked by the importance score, from the genome-wide linkage panel of 5742 SNPs, we were able to accurately predict RA patients and normal subjects with sensitivity of approximately 90% and specificity of approximately 80%, which was confirmed by five-fold cross-validation. However, in a complete training-testing framework, replication of genetic predictors was less satisfactory; thus, further evaluation of existing methodology and development of new methods are warranted.
Multiple genetic factors and their interactive effects are speculated to contribute to complex diseases. Detecting such genetic interactive effects, i.e., epistatic interactions, however, remains a significant challenge in large-scale association studies.
We have developed a new method, named SNPInterForest, for identifying epistatic interactions by extending an ensemble learning technique called random forest. Random forest is a predictive method that has been proposed for use in discovering single-nucleotide polymorphisms (SNPs), which are most predictive of the disease status in association studies. However, it is less sensitive to SNPs with little marginal effect. Furthermore, it does not natively exhibit information on interaction patterns of susceptibility SNPs. We extended the random forest framework to overcome the above limitations by means of (i) modifying the construction of the random forest and (ii) implementing a procedure for extracting interaction patterns from the constructed random forest. The performance of the proposed method was evaluated by simulated data under a wide spectrum of disease models. SNPInterForest performed very well in successfully identifying pure epistatic interactions with high precision and was still more than capable of concurrently identifying multiple interactions under the existence of genetic heterogeneity. It was also performed on real GWAS data of rheumatoid arthritis from the Wellcome Trust Case Control Consortium (WTCCC), and novel potential interactions were reported.
SNPInterForest, offering an efficient means to detect epistatic interactions without statistical analyses, is promising for practical use as a way to reveal the epistatic interactions involved in common complex diseases.
The CATHGEN study reported associations of chromosome 3q13-21 genes (KALRN, MYLK, CDGAP, and GATA2) with early-onset coronary artery disease (CAD). This study attempted to independently validate those associations. Eleven single nucleotide polymorphisms (SNPs) were examined (rs10934490, rs16834817, rs6810298, rs9289231, rs12637456, rs1444768, rs1444754, rs4234218, rs2335052, rs3803, rs2713604) in patients (N=1,618) from the Intermountain Heart Collaborative Study (IHCS). Given the higher smoking prevalence in CATHGEN than IHCS (41% vs 11% in controls, 74% vs 29% in cases), smoking stratification and genotype-smoking interactions were evaluated. Suggestive association was found for GATA2 (rs2713604, p=0.057, OR=1.2). Among smokers, associations were found in CDGAP (rs10934490, p=0.019, OR=1.6) and KALRN (rs12637456, p=0.011, OR=2.0) and suggestive association in MYLK (rs16834871, p=0.051, OR=1.8, adjusting for gender). No SNP association was found among non-smokers, but smoking/SNP interactions were detected for CDGAP (rs10934491, p=0.017) and KALRN (rs12637456, p=0.010). Similar differences in SNP effects by smoking status were observed on re-analysis of CATHGEN. CAD associations were suggestive for GATA2 and among smokers significant post hoc associations were found in KALRN, MYLK, and CDGAP. Genetic risk conferred by some of these genes may be modified by smoking. Future CAD association studies of these and other genes should evaluate effect modification by smoking.
coronary disease; genetic association; replication study; smoking
Susceptibility variants identified by genome-wide association studies (GWAS) have modest effect sizes. Whether such variants provide incremental information in assessing risk for common 'complex' diseases is unclear. We investigated whether measured and imputed genotypes from a GWAS dataset linked to the electronic medical record alter estimates of coronary heart disease (CHD) risk.
Study participants (n = 1243) had no known cardiovascular disease and were considered to be at high, intermediate, or low 10-year risk of CHD based on the Framingham risk score (FRS) which includes age, sex, total and HDL cholesterol, blood pressure, diabetes, and smoking status. Of twelve SNPs identified in prior GWAS to be associated with CHD, four were genotyped in the participants as part of a GWAS. Genotypes for seven SNPs were imputed from HapMap CEU population using the program MACH. We calculated a multiplex genetic risk score for each patient based on the odds ratios of the susceptibility SNPs and incorporated this into the FRS.
The mean (SD) number of risk alleles was 12.31 (1.95), range 6-18. The mean (SD) of the weighted genetic risk score was 12.64 (2.05), range 5.75-18.20. The CHD genetic risk score was not correlated with the FRS (P = 0.78). After incorporating the genetic risk score into the FRS, a total of 380 individuals (30.6%) were reclassified into higher-(188) or lower-risk groups (192).
A genetic risk score based on measured/imputed genotypes at 11 susceptibility SNPs, led to significant reclassification in the 10-y CHD risk categories. Additional prospective studies are needed to assess accuracy and clinical utility of such reclassification.
Random forest is an efficient approach for investigating not only the effects of individual markers on a trait but also the effect of the interactions among the markers in genetic association studies. This approach is especially appealing for the analysis of genome-wide data, such as those obtained from gene expression/single-nucleotide polymorphism (SNP) array experiments in which the number of candidate genes/SNPs is vast. We applied this approach to the Genetic Analysis Workshop 16 Problem 1 data to identify SNPs that contribute to rheumatoid arthritis. The random forest computed a raw importance score for each SNP marker, where higher importance score suggests higher level of association between the marker and the trait. The significance level of the association was determined empirically by repeatedly reapplying the random forest on randomly generated data under the null hypothesis that no association exists between the markers and the trait. Using random forest, we were able to identify 228 significant SNPs (at the genome-wide significant level of 0.05) across the whole genome, over two-thirds of which are located on chromosome 6, especially clustered in the region of 6p21 containing the human leukocyte antigen (HLA) genes, such as gene HLA-DRB1 and HLA-DRA. Further analysis of this region indicates a strong association to the rheumatoid arthritis status.
Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for futher study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction.
Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact.
In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.
The aim of this study was to detect the effect of interactions between single-nucleotide polymorphisms (SNPs) on incidence of heart diseases. For this purpose, 2912 subjects with 350,160 SNPs from the Framingham Heart Study (FHS) were analyzed. PLINK was used to control quality and to select the 10,000 most significant SNPs. A classification tree algorithm, Generalized, Unbiased, Interaction Detection and Estimation (GUIDE), was employed to build a classification tree to detect SNP-by-SNP interactions for the selected 10 k SNPs. The classes generated by GUIDE were reexamined by a generalized estimating equations (GEE) model with the empirical variance after accounting for potential familial correlation. Overall, 17 classes were generated based on the splitting criteria in GUIDE. The prevalence of coronary heart disease (CHD) in class 16 (determined by SNPs rs1894035, rs7955732, rs2212596, and rs1417507) was the lowest (0.23%). Compared to class 16, all other classes except for class 288 (prevalence of 1.2%) had a significantly greater risk when analyzed using GEE model. This suggests the interactions of SNPs on these node paths are significant.
We sought to examine whether ε4 carrier status modifies the relation between body mass index (BMI) and HDL. The National Heart, Lung, and Blood Institute Family Heart Study included 657 families with high family risk scores for coronary heart disease and 588 randomly selected families of probands in the Framingham, Atherosclerosis Risk in Communities, and Utah Family Health Tree studies. We selected 1402 subjects who had ε4 carrier status available. We used generalized estimating equations to examine the interaction between BMI and ε4 allele carrier status on HDL after adjusting for age, gender, smoking, alcohol intake, mono- and poly-unsaturated fat intake, exercise, comorbidities, LDL, and family cluster.
The mean (standard deviation) age of included subjects was 56.4(11.0) years and 47% were male. Adjusted means of HDL for normal, overweight, and obese BMI categories were 51.2(± 0.97), 45.0(± 0.75), and 41.6(± 0.93), respectively, among 397 ε4 carriers (p for trend < 0.0001) and 53.6(± 0.62), 51.3(± 0.49), and 45.0(± 0.62), respectively, among 1005 non-carriers of the ε4 allele (p-value for trend < 0.0001). There was no evidence for an interaction between BMI and ε4 status on HDL(p-value 0.39).
Our findings do not support an interaction between ε4 allele status and BMI on HDL.
HDL cholesterol; body mass index; genetic epidemiology; apolipoproteins; lipid metabolism; adiposity
The contribution of common genetic variation to one or more established smoking behaviors was investigated in a joint analysis of two genome wide association studies (GWAS) performed as part of the Cancer Genetic Markers of Susceptibility (CGEMS) project in 2,329 men from the Prostate, Lung, Colon and Ovarian (PLCO) Trial, and 2,282 women from the Nurses' Health Study (NHS). We analyzed seven measures of smoking behavior, four continuous (cigarettes per day [CPD], age at initiation of smoking, duration of smoking, and pack years), and three binary (ever versus never smoking, ≤10 versus >10 cigarettes per day [CPDBI], and current versus former smoking). Association testing for each single nucleotide polymorphism (SNP) was conducted by study and adjusted for age, cohabitation/marital status, education, site, and principal components of population substructure. None of the SNPs achieved genome-wide significance (p<10−7) in any combined analysis pooling evidence for association across the two studies; we observed between two and seven SNPs with p<10−5 for each of the seven measures. In the chr15q25.1 region spanning the nicotinic receptors CHRNA3 and CHRNA5, we identified multiple SNPs associated with CPD (p<10−3), including rs1051730, which has been associated with nicotine dependence, smoking intensity and lung cancer risk. In parallel, we selected 11,199 SNPs drawn from 359 a priori candidate genes and performed individual-gene and gene-group analyses. After adjusting for multiple tests conducted within each gene, we identified between two and five genes associated with each measure of smoking behavior. Besides CHRNA3 and CHRNA5, MAOA was associated with CPDBI (gene-level p<5.4×10−5), our analysis provides independent replication of the association between the chr15q25.1 region and smoking intensity and data for multiple other loci associated with smoking behavior that merit further follow-up.
The key roles of epistatic interactions between multiple genetic variants in the pathogenesis of complex diseases notwithstanding, the detection of such interactions remains a great challenge in genome-wide association studies. Although some existing multi-locus approaches have shown their successes in small-scale case-control data, the "combination explosion" course prohibits their applications to genome-wide analysis. It is therefore indispensable to develop new methods that are able to reduce the search space for epistatic interactions from an astronomic number of all possible combinations of genetic variants to a manageable set of candidates.
We studied case-control data from the viewpoint of binary classification. More precisely, we treated single nucleotide polymorphism (SNP) markers as categorical features and adopted the random forest to discriminate cases against controls. On the basis of the gini importance given by the random forest, we designed a sliding window sequential forward feature selection (SWSFS) algorithm to select a small set of candidate SNPs that could minimize the classification error and then statistically tested up to three-way interactions of the candidates. We compared this approach with three existing methods on three simulated disease models and showed that our approach is comparable to, sometimes more powerful than, the other methods. We applied our approach to a genome-wide case-control dataset for Age-related Macular Degeneration (AMD) and successfully identified two SNPs that were reported to be associated with this disease.
Besides existing pure statistical approaches, we demonstrated the feasibility of incorporating machine learning methods into genome-wide case-control studies. The gini importance offers yet another measure for the associations between SNPs and complex diseases, thereby complementing existing statistical measures to facilitate the identification of epistatic interactions and the understanding of epistasis in the pathogenesis of complex diseases.
Motivation: High-dimensional data are frequently generated in genome-wide association studies (GWAS) and other studies. It is important to identify features such as single nucleotide polymorphisms (SNPs) in GWAS that are associated with a disease. Random forests represent a very useful approach for this purpose, using a variable importance score. This importance score has several shortcomings. We propose an alternative importance measure to overcome those shortcomings.
Results: We characterized the effect of multiple SNPs under various models using our proposed importance measure in random forests, which uses maximal conditional chi-square (MCC) as a measure of association between a SNP and the trait conditional on other SNPs. Based on this importance measure, we employed a permutation test to estimate empirical P-values of SNPs. Our method was compared to a univariate test and the permutation test using the Gini and permutation importance. In simulation, the proposed method performed consistently superior to the other methods in identifying of risk SNPs. In a GWAS of age-related macular degeneration, the proposed method confirmed two significant SNPs (at the genome-wide adjusted level of 0.05). Further analysis showed that these two SNPs conformed with a heterogeneity model. Compared with the existing importance measures, the MCC importance measure is more sensitive to complex effects of risk SNPs by utilizing conditional information on different SNPs. The permutation test with the MCC importance measure provides an efficient way to identify candidate SNPs in GWAS and facilitates the understanding of the etiology between genetic variants and complex diseases.
Supplementary information: Supplementary data are available at Bioinformatics online.
Socioeconomic status (SES) predicts coronary heart disease independently of the traditional risk factors included in the Framingham risk score. However, it is unknown whether changes in Framingham risk score variables over time explain the association between SES and coronary heart disease. We examined this question given its relevance to risk assessment in clinical decision making.
The Atherosclerosis Risk in Communities study data (initiated in 1987 with 10-years follow-up of 15,495 adults aged 45-64 years in four Southern and Mid-Western communities) were used. SES was assessed at baseline, dichotomized as low SES (defined as low education and/or low income) or not. The time dependent variables - smoking, total and high density lipoprotein cholesterol, systolic blood pressure and use of blood pressure lowering medication - were assessed every three years. Ten-year incidence of coronary heart disease was based on EKG and cardiac enzyme criteria, or adjudicated death certificate data. Cox survival analyses examined the contribution of SES to heart disease risk independent of baseline Framingham risk score, without and with further adjustment for the time dependent variables.
Adjusting for baseline Framingham risk score, low SES was associated with an increased coronary heart disease risk (hazard ratio [HR] = 1.53; 95% Confidence Interval [CI], 1.27 to1.85). After further adjustment for the time dependent variables, the SES effect remained significant (HR = 1.44; 95% CI, 1.19 to1.74).
Using Framingham Risk Score alone under estimated the coronary heart disease risk in low SES persons. This bias was not eliminated by subsequent changes in Framingham risk score variables.
coronary disease; cholesterol; epidemiology; prevention; risk factors
Aim: Genome-wide association studies have identified variants on chromosome 9p21 that are associated with coronary heart disease (CHD). The relationship between these variants and the age of onset of CHD is less clear. The aim of this study was to examine the allelic frequencies and haplotype structure of eight single-nucleotide polymorphisms (SNPs) on chromosome 9p21 in ethnically diverse women. We also explored the relationship between 9p21 SNPs and the age of CHD onset. Results: There was considerable interethnic allelic and haplotype diversity across the 9p21 locus with only two SNPs (rs10757274 and rs4977574) in perfect linkage disequilibrium in both races, and only a small proportion of the haplotypes shared between the racial groups. With the exception of rs1333040, whites with at least one copy of the 9p21 SNP risk alleles were found to have CHD from 1.45 (rs10116277) to 4.77 (rs2383206) years earlier than those with the wild-type alleles. Blacks carrying at least one copy of the risk allele (92%) for rs1333040 had a CHD age of onset that was 6.5 years earlier than those with the wild-type alleles. Conclusions: Different variants on chromosome 9p21 may influence CHD age of onset in whites and blacks.
Comparison of patients with coronary heart disease and controls in genome-wide association studies has revealed several single nucleotide polymorphisms (SNPs) associated with coronary heart disease. We aimed to establish the external validity of these findings and to obtain more precise risk estimates using a prospective cohort design.
We tested 13 recently discovered SNPs for association with coronary heart disease in a case-control design including participants differing from those in the discovery samples (3829 participants with prevalent coronary heart disease and 48 897 controls free of the disease) and a prospective cohort design including 30 725 participants free of cardiovascular disease from Finland and Sweden. We modelled the 13 SNPs as a multilocus genetic risk score and used Cox proportional hazards models to estimate the association of genetic risk score with incident coronary heart disease. For case-control analyses we analysed associations between individual SNPs and quintiles of genetic risk score using logistic regression.
In prospective cohort analyses, 1264 participants had a first coronary heart disease event during a median 10·7 years' follow-up (IQR 6·7–13·6). Genetic risk score was associated with a first coronary heart disease event. When compared with the bottom quintile of genetic risk score, participants in the top quintile were at 1·66-times increased risk of coronary heart disease in a model adjusting for traditional risk factors (95% CI 1·35–2·04, p value for linear trend=7·3×10−10). Adjustment for family history did not change these estimates. Genetic risk score did not improve C index over traditional risk factors and family history (p=0·19), nor did it have a significant effect on net reclassification improvement (2·2%, p=0·18); however, it did have a small effect on integrated discrimination index (0·004, p=0·0006). Results of the case-control analyses were similar to those of the prospective cohort analyses.
Using a genetic risk score based on 13 SNPs associated with coronary heart disease, we can identify the 20% of individuals of European ancestry who are at roughly 70% increased risk of a first coronary heart disease event. The potential clinical use of this panel of SNPs remains to be defined.
The Wellcome Trust; Academy of Finland Center of Excellence for Complex Disease Genetics; US National Institutes of Health; the Donovan Family Foundation.
Cardiovascular disease (CVD) and its most common manifestations – including coronary heart disease (CHD), stroke, heart failure (HF), and atrial fibrillation (AF) – are major causes of morbidity and mortality. In many industrialized countries, cardiovascular disease (CVD) claims more lives each year than any other disease. Heart disease and stroke are the first and third leading causes of death in the United States. Prior investigations have reported several single gene variants associated with CHD, stroke, HF, and AF. We report a community-based genome-wide association study of major CVD outcomes.
In 1345 Framingham Heart Study participants from the largest 310 pedigrees (54% women, mean age 33 years at entry), we analyzed associations of 70,987 qualifying SNPs (Affymetrix 100K GeneChip) to four major CVD outcomes: major atherosclerotic CVD (n = 142; myocardial infarction, stroke, CHD death), major CHD (n = 118; myocardial infarction, CHD death), AF (n = 151), and HF (n = 73). Participants free of the condition at entry were included in proportional hazards models. We analyzed model-based deviance residuals using generalized estimating equations to test associations between SNP genotypes and traits in additive genetic models restricted to autosomal SNPs with minor allele frequency ≥0.10, genotype call rate ≥0.80, and Hardy-Weinberg equilibrium p-value ≥ 0.001.
Six associations yielded p < 10-5. The lowest p-values for each CVD trait were as follows: major CVD, rs499818, p = 6.6 × 10-6; major CHD, rs2549513, p = 9.7 × 10-6; AF, rs958546, p = 4.8 × 10-6; HF: rs740363, p = 8.8 × 10-6. Of note, we found associations of a 13 Kb region on chromosome 9p21 with major CVD (p 1.7 – 1.9 × 10-5) and major CHD (p 2.5 – 3.5 × 10-4) that confirm associations with CHD in two recently reported genome-wide association studies. Also, rs10501920 in CNTN5 was associated with AF (p = 9.4 × 10-6) and HF (p = 1.2 × 10-4). Complete results for these phenotypes can be found at the dbgap website .
No association attained genome-wide significance, but several intriguing findings emerged. Notably, we replicated associations of chromosome 9p21 with major CVD. Additional studies are needed to validate these results. Finding genetic variants associated with CVD may point to novel disease pathways and identify potential targeted preventive therapies.
Previous studies have suggested a positive association between phenotypes of fucosyltransferase 3 (FUT3) gene (also known as Lewis gene) and coronary heart disease.
We used data on 1,735 unrelated subjects in the Framingham Offspring Study to assess whether 3 functional single nucleotide polymorphisms (SNPs) of the FUT3 gene (T59G, T1067A, and T202C) were associated with prevalent atherothrombotic disease.
Contrary to T1067A and T202C SNPs, there was evidence for an association between T59G SNP and atherothrombotic disease prevalence. In a multivariable model controlling for age, sex, alcohol intake, pack-years of smoking, ratio of total-to-HDL-cholesterol, and diabetes mellitus, odds ratios (95% CI) for prevalent atherothrombotic disease were 1.0 (reference), 0.80 (0.46-1.41), and 6.70 (1.95-23.01) for TT, TG, and GG genotypes of the T59G SNP, respectively. Minor alleles of T202C and T1067A SNPs showed a modest and non-significant association with atherothrombotic disease. Overall, FUT3 polymorphism that influences the enzyme activity (GG genotype for T59G or ≥ 1 minor allele of T202C or T1067A) was associated with increased atherothrombotic disease prevalence [OR: 1.57 (1.05-2.34)] and this association was stronger among abstainers (2-fold increased odds) than among current drinkers (p for interaction 0.11).
Our data suggest that functional mutations of the FUT3 gene may be associated with an increased atherothrombotic disease prevalence, especially among abstainers. Additional studies are warranted to confirm these findings.
Cardiovascular disease; FUT3 gene; epidemiology; genetics
Cardiovascular Disease is the leading cause of death among Americans. Inflammation is a hallmark of the development of atherosclerosis and is mediated by prostaglandins, catalyzed by cyclooxygenase (COX)-2. We sought to determine if variants in the COX-2 gene were associated with subclinical measures of cardiovascular disease in a primarily type 2 diabetic population.
Eight polymorphisms in COX-2 were genotyped and vascular calcified plaque measured in the coronary, carotid, and aortic arterial beds in 977 Caucasian siblings (83% with T2DM) from 369 Diabetes Heart Study families. Tests for single SNP and haplotypic association were performed using SOLAR and QPDT, respectively (results adjusted for age, gender, diabetes affection status, smoking, and use of lipid altering medications).
All eight SNPs genotyped were found to be in strong pair-wise linkage disequilibrium (D′=1.0). Three SNPs (rs689466, rs2066826 and rs20417) are associated with either coronary or carotid calcified plaque. Subjects carrying the G allele of rs689466 (n=31) or the A allele of rs2066826 (n=16) had significantly lower coronary calcified plaque (p=0.02 and 0.04, respectively). Subjects homozygous for the C allele of rs20417 (n=22) or the A allele of rs2066826 (n=16) had increased carotid calcified plaque (p=0.011, p=0.014). In addition, multiple 2-SNP and 3-SNP haplotypes were associated with CorCP with p-values ranging from P = 0.002 to P = 0.035.
Polymorphisms in COX2 were associated with significant changes in coronary and carotid calcified plaque. Diabetic individuals with these variants may be at higher risk for developing cardiovascular disease.
Type 2 Diabetes Mellitus; Cardiovascular Disease; Polymorphisms; Inflammation; Vascular Calcification; Cyclooxygenase
There is an increasing incidence of esophageal adenocarcinoma (EA) among younger people in the western populations. However, the association between genetic polymorphisms and the age of EA onset is unclear. In this study, 1330 functional/tagging single-nucleotide polymorphisms (SNPs) from 354 cancer-related genes were genotyped in 335 white EA patients. Twenty important SNPs that have the highest importance scores and lowest classification error rate were identified by the random forest algorithm to be associated with early onset of EA (age ≤ 55 years). Subsequent logistic regression analysis indicated that 10 SNPs (rs2070744 of NOS3, rs720321 of BCL2, rs17757541 of BCL2, rs11775256 of TNFRSF10A, rs1035142 of CASP8, rs2236302 of MMP14, rs4740363 of ABL1, rs696217 of GHRL, rs2445762 of CYP19A1, and rs11941492 of VEGFR2/KDR) were significantly associated with early onset of EA (≤55 vs >55 years, all P < .05 after adjusting for co-variates and false discovery rate). Among them, five SNPs in the NOS3, BCL2, TNFRSF10A, and CASP8 genes were known to be involved in apoptosis processes. In Kaplan-Meier analyses, rs2070744 of NOS3, rs720321 of BCL2, and rs1035142 of CASP8 were also significantly associated with early onset of EA. Moreover, there was a higher risk of developing EA at a younger age when one had more risk genotypes. In conclusion, polymorphisms in cancer-related genes, especially those in the apoptotic pathway, play an important role in the development of younger-aged EA in a dose-response manner.
Personalized health-care promises tailored health-care solutions to individual patients based on their genetic background and/or environmental exposure history. To date, disease prediction has been based on a few environmental factors and/or single nucleotide polymorphisms (SNPs), while complex diseases are usually affected by many genetic and environmental factors with each factor contributing a small portion to the outcome. We hypothesized that the use of random forests classifiers to select SNPs would result in an improved predictive model of asthma exacerbations. We tested this hypothesis in a population of childhood asthmatics.
In this study, using emergency room visits or hospitalizations as the definition of a severe asthma exacerbation, we first identified a list of top Genome Wide Association Study (GWAS) SNPs ranked by Random Forests (RF) importance score for the CAMP (Childhood Asthma Management Program) population of 127 exacerbation cases and 290 non-exacerbation controls. We predict severe asthma exacerbations using the top 10 to 320 SNPs together with age, sex, pre-bronchodilator FEV1 percentage predicted, and treatment group.
Testing in an independent set of the CAMP population shows that severe asthma exacerbations can be predicted with an Area Under the Curve (AUC) = 0.66 with 160-320 SNPs in comparison to an AUC score of 0.57 with 10 SNPs. Using the clinical traits alone yielded AUC score of 0.54, suggesting the phenotype is affected by genetic as well as environmental factors.
Our study shows that a random forests algorithm can effectively extract and use the information contained in a small number of samples. Random forests, and other machine learning tools, can be used with GWAS studies to integrate large numbers of predictors simultaneously.
Rationale: Previously reported linkage to FEV1 (LOD score = 5.0) on 6q27 in the Framingham Heart Study (FHS) led us to explore a candidate gene, SMOC2, at 168.6 Mb.
Objectives: We tested association between SMOC2 polymorphisms and FEV1 and FVC in unrelated FHS participants.
Methods: Twenty single-nucleotide polymorphisms (SNPs) around SMOC2 were genotyped in 1,734 subjects.
Measurements and Main Results: SNP data were analyzed using multiple linear regression models incorporating sex, age, body mass index, height, and smoking history as covariates, and analyses were repeated within strata of ever- and never-smokers. The minor allele of SNP rs1402 was associated with higher mean FEV1 (p = 0.003) and FVC (p = 0.02) measures. In never-smoking subjects, association with higher measures was observed with the minor allele of rs747995 (FEV1, p = 0.0006; FVC, p = 0.0008). These two SNPs lie in different haplotype blocks and reside in intron 4 of SMOC2. Haplotype analysis revealed a common G-T haplotype (rs747995–rs1402) with 77% frequency in never-smoking FHS subjects. The G-T haplotype was associated with reduction of 126 ml for FEV1 (p = 0.0002) and 157 ml for FVC (p = 0.0002). The G-T haplotype was similarly associated in a set of never-smoking subjects from the Family Heart Study (FEV1, p = 0.03; FVC, p = 0.03).
Conclusions: The replication of the association in two populations supports the possibility that SMOC2 might play an important role in the determination of FEV1 and FVC.
FEV1; FVC; genetics; single-nucleotide polymorphism
The adipocyte-derived protein adiponectin is highly heritable and inversely associated with risk of type 2 diabetes mellitus (T2D) and coronary heart disease (CHD). We meta-analyzed 3 genome-wide association studies for circulating adiponectin levels (n = 8,531) and sought validation of the lead single nucleotide polymorphisms (SNPs) in 5 additional cohorts (n = 6,202). Five SNPs were genome-wide significant in their relationship with adiponectin (P≤5×10−8). We then tested whether these 5 SNPs were associated with risk of T2D and CHD using a Bonferroni-corrected threshold of P≤0.011 to declare statistical significance for these disease associations. SNPs at the adiponectin-encoding ADIPOQ locus demonstrated the strongest associations with adiponectin levels (P-combined = 9.2×10−19 for lead SNP, rs266717, n = 14,733). A novel variant in the ARL15 (ADP-ribosylation factor-like 15) gene was associated with lower circulating levels of adiponectin (rs4311394-G, P-combined = 2.9×10−8, n = 14,733). This same risk allele at ARL15 was also associated with a higher risk of CHD (odds ratio [OR] = 1.12, P = 8.5×10−6, n = 22,421) more nominally, an increased risk of T2D (OR = 1.11, P = 3.2×10−3, n = 10,128), and several metabolic traits. Expression studies in humans indicated that ARL15 is well-expressed in skeletal muscle. These findings identify a novel protein, ARL15, which influences circulating adiponectin levels and may impact upon CHD risk.
Through a meta-analysis of genome-wide association studies of 14,733 individuals, we identified common base-pair variants in the genome which influence circulating adiponectin levels. Since adiponectin is an adipocyte-derived circulating protein which has been inversely associated with risk of obesity-related diseases such as type 2 diabetes (T2D) and coronary heart disease (CHD), we next sought to understand if the identified variants influencing adiponectin levels also influence risk of T2D, CHD, and several metabolic traits. In addition to confirming that variation at the ADIPOQ locus influences adiponectin levels, our analyses point to a variant in the ARL15 (ADP-ribosylation factor-like 15) locus which decreases adiponectin levels and increases risk of CHD and T2D. Further, this same variant was associated with increased fasting insulin levels and glycated hemoglobin. While the function of ARL15 is not known, we provide insight into the tissue specificity of ARL15 expression. These results thus provide novel insights into the physiology of the adiponectin pathway and obesity-related diseases.
Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a “wrapper” strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case–cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.
coronary heart disease; genome-wide association studies; Random Forests classifier; SNPs; variable selection
Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a “wrapper” strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case-cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.
coronary heart disease; genome-wide association studies; Random Forests classifier; SNPs; variable selection
The complement factor H (CFH) gene has been recently confirmed to play an essential role in the development of age-related macular degeneration (AMD). There are conflicting reports of its role in coronary heart disease. This study was designed to investigate if, using a family-based approach, there was an association between genetic variants of the CFH gene and risk of early-onset coronary heart disease.
We evaluated 6 SNPs and 5 common haplotypes in the CFH gene amongst 1494 individuals in 580 Irish families with at least one member prematurely affected with coronary heart disease. Genotypes were determined by multiplex SNaPshot technology.
Using the TDT/S-TDT test, we did not find an association between any of the individual SNPs or any of the 5 haplotypes and early-onset coronary heart disease.
In this family-based study, we found no association between the CFH gene and early-onset coronary heart disease.