1.  Evaluation of single-nucleotide polymorphism imputation using random forests 
BMC Proceedings  2009;3(Suppl 7):S65.
Genome-wide association studies (GWAS) have helped to reveal genetic mechanisms of complex diseases. Although commonly used genotyping technology enables us to determine up to a million single-nucleotide polymorphisms (SNPs), causative variants are typically not genotyped directly. A favored approach to increase the power of genome-wide association studies is to impute the untyped SNPs using more complete genotype data of a reference population.
Random forests (RF) provides an internal method for replacing missing genotypes. A forest of classification trees is used to determine similarities of probands regarding their genotypes. These proximities are then used to impute genotypes of untyped SNPs.
We evaluated this approach using genotype data of the Framingham Heart Study provided as Problem 2 for Genetic Analysis Workshop 16 and the Caucasian HapMap samples as reference population. Our results indicate that RFs are faster but less accurate than alternative approaches for imputing untyped SNPs.
PMCID: PMC2795966  PMID: 20018059
2.  Validation Study of Genetic Associations with Coronary Artery Disease on Chromosome 3q13-21 and Potential Effect Modification by Smoking 
Annals of human genetics  2009;73(Pt 6):551-558.
The CATHGEN study reported associations of chromosome 3q13-21 genes (KALRN, MYLK, CDGAP, and GATA2) with early-onset coronary artery disease (CAD). This study attempted to independently validate those associations. Eleven single nucleotide polymorphisms (SNPs) were examined (rs10934490, rs16834817, rs6810298, rs9289231, rs12637456, rs1444768, rs1444754, rs4234218, rs2335052, rs3803, rs2713604) in patients (N=1,618) from the Intermountain Heart Collaborative Study (IHCS). Given the higher smoking prevalence in CATHGEN than IHCS (41% vs 11% in controls, 74% vs 29% in cases), smoking stratification and genotype-smoking interactions were evaluated. Suggestive association was found for GATA2 (rs2713604, p=0.057, OR=1.2). Among smokers, associations were found in CDGAP (rs10934490, p=0.019, OR=1.6) and KALRN (rs12637456, p=0.011, OR=2.0) and suggestive association in MYLK (rs16834871, p=0.051, OR=1.8, adjusting for gender). No SNP association was found among non-smokers, but smoking/SNP interactions were detected for CDGAP (rs10934491, p=0.017) and KALRN (rs12637456, p=0.010). Similar differences in SNP effects by smoking status were observed on re-analysis of CATHGEN. CAD associations were suggestive for GATA2 and among smokers significant post hoc associations were found in KALRN, MYLK, and CDGAP. Genetic risk conferred by some of these genes may be modified by smoking. Future CAD association studies of these and other genes should evaluate effect modification by smoking.
PMCID: PMC2764812  PMID: 19706030
coronary disease; genetic association; replication study; smoking
3.  Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests 
BMC Proceedings  2007;1(Suppl 1):S62.
Using the North American Rheumatoid Arthritis Consortium (NARAC) candidate gene and genome-wide single-nucleotide polymorphism (SNP) data sets, we applied regression methods and tree-based random forests to identify genetic associations with rheumatoid arthritis (RA) and to predict RA disease status. Several genes were consistently identified as weakly associated with RA without a significant interaction or combinatorial effect with other candidate genes. Using random forests, the tested candidate gene SNPs were not sufficient to predict RA patients and normal subjects with high accuracy. However, using the top 500 SNPs, ranked by the importance score, from the genome-wide linkage panel of 5742 SNPs, we were able to accurately predict RA patients and normal subjects with sensitivity of approximately 90% and specificity of approximately 80%, which was confirmed by five-fold cross-validation. However, in a complete training-testing framework, replication of genetic predictors was less satisfactory; thus, further evaluation of existing methodology and development of new methods are warranted.
PMCID: PMC2367463  PMID: 18466563
4.  Detecting significant single-nucleotide polymorphisms in a rheumatoid arthritis study using random forests 
BMC Proceedings  2009;3(Suppl 7):S69.
Random forest is an efficient approach for investigating not only the effects of individual markers on a trait but also the effect of the interactions among the markers in genetic association studies. This approach is especially appealing for the analysis of genome-wide data, such as those obtained from gene expression/single-nucleotide polymorphism (SNP) array experiments in which the number of candidate genes/SNPs is vast. We applied this approach to the Genetic Analysis Workshop 16 Problem 1 data to identify SNPs that contribute to rheumatoid arthritis. The random forest computed a raw importance score for each SNP marker, where higher importance score suggests higher level of association between the marker and the trait. The significance level of the association was determined empirically by repeatedly reapplying the random forest on randomly generated data under the null hypothesis that no association exists between the markers and the trait. Using random forest, we were able to identify 228 significant SNPs (at the genome-wide significant level of 0.05) across the whole genome, over two-thirds of which are located on chromosome 6, especially clustered in the region of 6p21 containing the human leukocyte antigen (HLA) genes, such as gene HLA-DRB1 and HLA-DRA. Further analysis of this region indicates a strong association to the rheumatoid arthritis status.
PMCID: PMC2795970  PMID: 20018063
5.  SNPInterForest: A new method for detecting epistatic interactions 
BMC Bioinformatics  2011;12:469.
Multiple genetic factors and their interactive effects are speculated to contribute to complex diseases. Detecting such genetic interactive effects, i.e., epistatic interactions, however, remains a significant challenge in large-scale association studies.
We have developed a new method, named SNPInterForest, for identifying epistatic interactions by extending an ensemble learning technique called random forest. Random forest is a predictive method that has been proposed for use in discovering single-nucleotide polymorphisms (SNPs), which are most predictive of the disease status in association studies. However, it is less sensitive to SNPs with little marginal effect. Furthermore, it does not natively exhibit information on interaction patterns of susceptibility SNPs. We extended the random forest framework to overcome the above limitations by means of (i) modifying the construction of the random forest and (ii) implementing a procedure for extracting interaction patterns from the constructed random forest. The performance of the proposed method was evaluated by simulated data under a wide spectrum of disease models. SNPInterForest performed very well in successfully identifying pure epistatic interactions with high precision and was still more than capable of concurrently identifying multiple interactions under the existence of genetic heterogeneity. It was also performed on real GWAS data of rheumatoid arthritis from the Wellcome Trust Case Control Consortium (WTCCC), and novel potential interactions were reported.
SNPInterForest, offering an efficient means to detect epistatic interactions without statistical analyses, is promising for practical use as a way to reveal the epistatic interactions involved in common complex diseases.
PMCID: PMC3260223  PMID: 22151604
6.  Genotype-informed estimation of risk of coronary heart disease based on genome-wide association data linked to the electronic medical record 
Susceptibility variants identified by genome-wide association studies (GWAS) have modest effect sizes. Whether such variants provide incremental information in assessing risk for common 'complex' diseases is unclear. We investigated whether measured and imputed genotypes from a GWAS dataset linked to the electronic medical record alter estimates of coronary heart disease (CHD) risk.
Study participants (n = 1243) had no known cardiovascular disease and were considered to be at high, intermediate, or low 10-year risk of CHD based on the Framingham risk score (FRS) which includes age, sex, total and HDL cholesterol, blood pressure, diabetes, and smoking status. Of twelve SNPs identified in prior GWAS to be associated with CHD, four were genotyped in the participants as part of a GWAS. Genotypes for seven SNPs were imputed from HapMap CEU population using the program MACH. We calculated a multiplex genetic risk score for each patient based on the odds ratios of the susceptibility SNPs and incorporated this into the FRS.
The mean (SD) number of risk alleles was 12.31 (1.95), range 6-18. The mean (SD) of the weighted genetic risk score was 12.64 (2.05), range 5.75-18.20. The CHD genetic risk score was not correlated with the FRS (P = 0.78). After incorporating the genetic risk score into the FRS, a total of 380 individuals (30.6%) were reclassified into higher-(188) or lower-risk groups (192).
A genetic risk score based on measured/imputed genotypes at 11 susceptibility SNPs, led to significant reclassification in the 10-y CHD risk categories. Additional prospective studies are needed to assess accuracy and clinical utility of such reclassification.
PMCID: PMC3269823  PMID: 22151179
7.  A Comparison of Logistic Regression, Logic Regression, Classification Tree, and Random Forests to Identify Effective Gene-Gene and Gene-Environmental Interactions 
Genome wide association studies (GWAS) have identified numerous single nucleotide polymorphisms (SNPs) that are associated with a variety of common human diseases. Due to the weak marginal effect of most disease-associated SNPs, attention has recently turned to evaluating the combined effect of multiple disease-associated SNPs on the risk of disease. Several recent multigenic studies show potential evidence of applying multigenic approaches in association studies of various diseases including lung cancer. But the question remains as to the best methodology to analyze single nucleotide polymorphisms in multiple genes. In this work, we consider four methods—logistic regression, logic regression, classification tree, and random forests—to compare results for identifying important genes or gene-gene and gene-environmental interactions. To evaluate the performance of four methods, the cross-validation misclassification error and areas under the curves are provided. We performed a simulation study and applied them to the data from a large-scale, population-based, case-control study.
PMCID: PMC3686280  PMID: 23795347
SNP interactions; Logistic regression; Classification tree; Logic regression; Random Forests; Cross-validation error; Area under the Curve
8.  Fucosyltransferase 3 Polymorphism and Atherothrombotic Disease in the Framingham Offspring Study 
American heart journal  2007;153(4):636-639.
Previous studies have suggested a positive association between phenotypes of fucosyltransferase 3 (FUT3) gene (also known as Lewis gene) and coronary heart disease.
We used data on 1,735 unrelated subjects in the Framingham Offspring Study to assess whether 3 functional single nucleotide polymorphisms (SNPs) of the FUT3 gene (T59G, T1067A, and T202C) were associated with prevalent atherothrombotic disease.
Contrary to T1067A and T202C SNPs, there was evidence for an association between T59G SNP and atherothrombotic disease prevalence. In a multivariable model controlling for age, sex, alcohol intake, pack-years of smoking, ratio of total-to-HDL-cholesterol, and diabetes mellitus, odds ratios (95% CI) for prevalent atherothrombotic disease were 1.0 (reference), 0.80 (0.46-1.41), and 6.70 (1.95-23.01) for TT, TG, and GG genotypes of the T59G SNP, respectively. Minor alleles of T202C and T1067A SNPs showed a modest and non-significant association with atherothrombotic disease. Overall, FUT3 polymorphism that influences the enzyme activity (GG genotype for T59G or ≥ 1 minor allele of T202C or T1067A) was associated with increased atherothrombotic disease prevalence [OR: 1.57 (1.05-2.34)] and this association was stronger among abstainers (2-fold increased odds) than among current drinkers (p for interaction 0.11).
Our data suggest that functional mutations of the FUT3 gene may be associated with an increased atherothrombotic disease prevalence, especially among abstainers. Additional studies are warranted to confirm these findings.
PMCID: PMC1865525  PMID: 17383304
Cardiovascular disease; FUT3 gene; epidemiology; genetics
9.  Classification tree for detection of single-nucleotide polymorphism (SNP)-by-SNP interactions related to heart disease: Framingham Heart Study 
BMC Proceedings  2009;3(Suppl 7):S83.
The aim of this study was to detect the effect of interactions between single-nucleotide polymorphisms (SNPs) on incidence of heart diseases. For this purpose, 2912 subjects with 350,160 SNPs from the Framingham Heart Study (FHS) were analyzed. PLINK was used to control quality and to select the 10,000 most significant SNPs. A classification tree algorithm, Generalized, Unbiased, Interaction Detection and Estimation (GUIDE), was employed to build a classification tree to detect SNP-by-SNP interactions for the selected 10 k SNPs. The classes generated by GUIDE were reexamined by a generalized estimating equations (GEE) model with the empirical variance after accounting for potential familial correlation. Overall, 17 classes were generated based on the splitting criteria in GUIDE. The prevalence of coronary heart disease (CHD) in class 16 (determined by SNPs rs1894035, rs7955732, rs2212596, and rs1417507) was the lowest (0.23%). Compared to class 16, all other classes except for class 288 (prevalence of 1.2%) had a significantly greater risk when analyzed using GEE model. This suggests the interactions of SNPs on these node paths are significant.
PMCID: PMC2795986  PMID: 20018079
10.  Framingham Heart Study 100K project: genome-wide associations for cardiovascular disease outcomes 
BMC Medical Genetics  2007;8(Suppl 1):S5.
Cardiovascular disease (CVD) and its most common manifestations – including coronary heart disease (CHD), stroke, heart failure (HF), and atrial fibrillation (AF) – are major causes of morbidity and mortality. In many industrialized countries, cardiovascular disease (CVD) claims more lives each year than any other disease. Heart disease and stroke are the first and third leading causes of death in the United States. Prior investigations have reported several single gene variants associated with CHD, stroke, HF, and AF. We report a community-based genome-wide association study of major CVD outcomes.
In 1345 Framingham Heart Study participants from the largest 310 pedigrees (54% women, mean age 33 years at entry), we analyzed associations of 70,987 qualifying SNPs (Affymetrix 100K GeneChip) to four major CVD outcomes: major atherosclerotic CVD (n = 142; myocardial infarction, stroke, CHD death), major CHD (n = 118; myocardial infarction, CHD death), AF (n = 151), and HF (n = 73). Participants free of the condition at entry were included in proportional hazards models. We analyzed model-based deviance residuals using generalized estimating equations to test associations between SNP genotypes and traits in additive genetic models restricted to autosomal SNPs with minor allele frequency ≥0.10, genotype call rate ≥0.80, and Hardy-Weinberg equilibrium p-value ≥ 0.001.
Six associations yielded p < 10-5. The lowest p-values for each CVD trait were as follows: major CVD, rs499818, p = 6.6 × 10-6; major CHD, rs2549513, p = 9.7 × 10-6; AF, rs958546, p = 4.8 × 10-6; HF: rs740363, p = 8.8 × 10-6. Of note, we found associations of a 13 Kb region on chromosome 9p21 with major CVD (p 1.7 – 1.9 × 10-5) and major CHD (p 2.5 – 3.5 × 10-4) that confirm associations with CHD in two recently reported genome-wide association studies. Also, rs10501920 in CNTN5 was associated with AF (p = 9.4 × 10-6) and HF (p = 1.2 × 10-4). Complete results for these phenotypes can be found at the dbgap website .
No association attained genome-wide significance, but several intriguing findings emerged. Notably, we replicated associations of chromosome 9p21 with major CVD. Additional studies are needed to validate these results. Finding genetic variants associated with CVD may point to novel disease pathways and identify potential targeted preventive therapies.
PMCID: PMC1995607  PMID: 17903304
11.  Screening large-scale association study data: exploiting interactions using random forests 
BMC Genetics  2004;5:32.
Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for futher study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction.
Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact.
In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.
PMCID: PMC545646  PMID: 15588316
12.  Genetic variants of Complement factor H gene are not associated with premature coronary heart disease: a family-based study in the Irish population 
BMC Medical Genetics  2007;8:62.
The complement factor H (CFH) gene has been recently confirmed to play an essential role in the development of age-related macular degeneration (AMD). There are conflicting reports of its role in coronary heart disease. This study was designed to investigate if, using a family-based approach, there was an association between genetic variants of the CFH gene and risk of early-onset coronary heart disease.
We evaluated 6 SNPs and 5 common haplotypes in the CFH gene amongst 1494 individuals in 580 Irish families with at least one member prematurely affected with coronary heart disease. Genotypes were determined by multiplex SNaPshot technology.
Using the TDT/S-TDT test, we did not find an association between any of the individual SNPs or any of the 5 haplotypes and early-onset coronary heart disease.
In this family-based study, we found no association between the CFH gene and early-onset coronary heart disease.
PMCID: PMC2048938  PMID: 17877809
13.  Genome-Wide and Candidate Gene Association Study of Cigarette Smoking Behaviors 
PLoS ONE  2009;4(2):e4653.
The contribution of common genetic variation to one or more established smoking behaviors was investigated in a joint analysis of two genome wide association studies (GWAS) performed as part of the Cancer Genetic Markers of Susceptibility (CGEMS) project in 2,329 men from the Prostate, Lung, Colon and Ovarian (PLCO) Trial, and 2,282 women from the Nurses' Health Study (NHS). We analyzed seven measures of smoking behavior, four continuous (cigarettes per day [CPD], age at initiation of smoking, duration of smoking, and pack years), and three binary (ever versus never smoking, ≤10 versus >10 cigarettes per day [CPDBI], and current versus former smoking). Association testing for each single nucleotide polymorphism (SNP) was conducted by study and adjusted for age, cohabitation/marital status, education, site, and principal components of population substructure. None of the SNPs achieved genome-wide significance (p<10−7) in any combined analysis pooling evidence for association across the two studies; we observed between two and seven SNPs with p<10−5 for each of the seven measures. In the chr15q25.1 region spanning the nicotinic receptors CHRNA3 and CHRNA5, we identified multiple SNPs associated with CPD (p<10−3), including rs1051730, which has been associated with nicotine dependence, smoking intensity and lung cancer risk. In parallel, we selected 11,199 SNPs drawn from 359 a priori candidate genes and performed individual-gene and gene-group analyses. After adjusting for multiple tests conducted within each gene, we identified between two and five genes associated with each measure of smoking behavior. Besides CHRNA3 and CHRNA5, MAOA was associated with CPDBI (gene-level p<5.4×10−5), our analysis provides independent replication of the association between the chr15q25.1 region and smoking intensity and data for multiple other loci associated with smoking behavior that merit further follow-up.
PMCID: PMC2644817  PMID: 19247474
14.  Secreted Modular Calcium-binding Protein 2 Haplotypes Are Associated with Pulmonary Function 
Rationale: Previously reported linkage to FEV1 (LOD score = 5.0) on 6q27 in the Framingham Heart Study (FHS) led us to explore a candidate gene, SMOC2, at 168.6 Mb.
Objectives: We tested association between SMOC2 polymorphisms and FEV1 and FVC in unrelated FHS participants.
Methods: Twenty single-nucleotide polymorphisms (SNPs) around SMOC2 were genotyped in 1,734 subjects.
Measurements and Main Results: SNP data were analyzed using multiple linear regression models incorporating sex, age, body mass index, height, and smoking history as covariates, and analyses were repeated within strata of ever- and never-smokers. The minor allele of SNP rs1402 was associated with higher mean FEV1 (p = 0.003) and FVC (p = 0.02) measures. In never-smoking subjects, association with higher measures was observed with the minor allele of rs747995 (FEV1, p = 0.0006; FVC, p = 0.0008). These two SNPs lie in different haplotype blocks and reside in intron 4 of SMOC2. Haplotype analysis revealed a common G-T haplotype (rs747995–rs1402) with 77% frequency in never-smoking FHS subjects. The G-T haplotype was associated with reduction of 126 ml for FEV1 (p = 0.0002) and 157 ml for FVC (p = 0.0002). The G-T haplotype was similarly associated in a set of never-smoking subjects from the Family Heart Study (FEV1, p = 0.03; FVC, p = 0.03).
Conclusions: The replication of the association in two populations supports the possibility that SMOC2 might play an important role in the determination of FEV1 and FVC.
PMCID: PMC1899283  PMID: 17204727
FEV1; FVC; genetics; single-nucleotide polymorphism
15.  Gene-Environment Interactions of Novel Variants Associated with Head and Neck Cancer 
Head & neck  2011;34(8):1111-1118.
A genome-wide association study for upper aerodigestive tract cancers identified 19 candidate single-nucleotide polymorphisms (SNPs). We used these SNPs to investigate the potential gene-gene and gene-environment interactions in head and neck squamous cell carcinoma (HNSCC) risk.
The 19 variants were genotyped using Taqman (Applied Biosystems) assays among 575 cases and 676 controls in our population-based case-control study.
A restricted cubic spline model suggested both ADH1B and HEL308 modified the association between smoking pack-years and HNSCC. Classification and regression tree analysis demonstrated a higher order interaction between smoking status, ADH1B, FLJ13089 and FLJ35784 in HNSCC risk. Compared with ever smokers carrying ADH1B T/C+T/T genotypes, smokers carrying ADH1B C/C genotype and FLJ13089 A/G+A/A genotypes had a highest risk of HNSCC (OR=1.84).
Our results suggest that the risk associated with these variants may be specifically important amongst specific exposure groups.
PMCID: PMC3662053  PMID: 22052802
post-genome wide association study; head and neck cancer; gene and environment interaction
16.  Application of Gene Network Analysis Techniques Identifies AXIN1/PDIA2 and Endoglin Haplotypes Associated with Bicuspid Aortic Valve 
PLoS ONE  2010;5(1):e8830.
Bicuspid Aortic Valve (BAV) is a highly heritable congenital heart defect. The low frequency of BAV (1% of general population) limits our ability to perform genome-wide association studies. We present the application of four a priori SNP selection techniques, reducing the multiple-testing penalty by restricting analysis to SNPs relevant to BAV in a genome-wide SNP dataset from a cohort of 68 BAV probands and 830 control subjects. Two knowledge-based approaches, CANDID and STRING, were used to systematically identify BAV genes, and their SNPs, from the published literature, microarray expression studies and a genome scan. We additionally tested Functionally Interpolating SNPs (fitSNPs) present on the array; the fourth consisted of SNPs selected by Random Forests, a machine learning approach. These approaches reduced the multiple testing penalty by lowering the fraction of the genome probed to 0.19% of the total, while increasing the likelihood of studying SNPs within relevant BAV genes and pathways. Three loci were identified by CANDID, STRING, and fitSNPS. A haplotype within the AXIN1-PDIA2 locus (p-value of 2.926×10−06) and a haplotype within the Endoglin gene (p-value of 5.881×10−04) were found to be strongly associated with BAV. The Random Forests approach identified a SNP on chromosome 3 in association with BAV (p-value 5.061×10−06). The results presented here support an important role for genetic variants in BAV and provide support for additional studies in well-powered cohorts. Further, these studies demonstrate that leveraging existing expression and genomic data in the context of GWAS studies can identify biologically relevant genes and pathways associated with a congenital heart defect.
PMCID: PMC2809109  PMID: 20098615
17.  A random forest approach to the detection of epistatic interactions in case-control studies 
BMC Bioinformatics  2009;10(Suppl 1):S65.
The key roles of epistatic interactions between multiple genetic variants in the pathogenesis of complex diseases notwithstanding, the detection of such interactions remains a great challenge in genome-wide association studies. Although some existing multi-locus approaches have shown their successes in small-scale case-control data, the "combination explosion" course prohibits their applications to genome-wide analysis. It is therefore indispensable to develop new methods that are able to reduce the search space for epistatic interactions from an astronomic number of all possible combinations of genetic variants to a manageable set of candidates.
We studied case-control data from the viewpoint of binary classification. More precisely, we treated single nucleotide polymorphism (SNP) markers as categorical features and adopted the random forest to discriminate cases against controls. On the basis of the gini importance given by the random forest, we designed a sliding window sequential forward feature selection (SWSFS) algorithm to select a small set of candidate SNPs that could minimize the classification error and then statistically tested up to three-way interactions of the candidates. We compared this approach with three existing methods on three simulated disease models and showed that our approach is comparable to, sometimes more powerful than, the other methods. We applied our approach to a genome-wide case-control dataset for Age-related Macular Degeneration (AMD) and successfully identified two SNPs that were reported to be associated with this disease.
Besides existing pure statistical approaches, we demonstrated the feasibility of incorporating machine learning methods into genome-wide case-control studies. The gini importance offers yet another measure for the associations between SNPs and complex diseases, thereby complementing existing statistical measures to facilitate the identification of epistatic interactions and the understanding of epistasis in the pathogenesis of complex diseases.
PMCID: PMC2648748  PMID: 19208169
18.  Do changes in traditional coronary heart disease risk factors over time explain the association between socio-economic status and coronary heart disease? 
Socioeconomic status (SES) predicts coronary heart disease independently of the traditional risk factors included in the Framingham risk score. However, it is unknown whether changes in Framingham risk score variables over time explain the association between SES and coronary heart disease. We examined this question given its relevance to risk assessment in clinical decision making.
The Atherosclerosis Risk in Communities study data (initiated in 1987 with 10-years follow-up of 15,495 adults aged 45-64 years in four Southern and Mid-Western communities) were used. SES was assessed at baseline, dichotomized as low SES (defined as low education and/or low income) or not. The time dependent variables - smoking, total and high density lipoprotein cholesterol, systolic blood pressure and use of blood pressure lowering medication - were assessed every three years. Ten-year incidence of coronary heart disease was based on EKG and cardiac enzyme criteria, or adjudicated death certificate data. Cox survival analyses examined the contribution of SES to heart disease risk independent of baseline Framingham risk score, without and with further adjustment for the time dependent variables.
Adjusting for baseline Framingham risk score, low SES was associated with an increased coronary heart disease risk (hazard ratio [HR] = 1.53; 95% Confidence Interval [CI], 1.27 to1.85). After further adjustment for the time dependent variables, the SES effect remained significant (HR = 1.44; 95% CI, 1.19 to1.74).
Using Framingham Risk Score alone under estimated the coronary heart disease risk in low SES persons. This bias was not eliminated by subsequent changes in Framingham risk score variables.
PMCID: PMC3130693  PMID: 21639906
coronary disease; cholesterol; epidemiology; prevention; risk factors
19.  Maximal conditional chi-square importance in random forests 
Bioinformatics  2010;26(6):831-837.
Motivation: High-dimensional data are frequently generated in genome-wide association studies (GWAS) and other studies. It is important to identify features such as single nucleotide polymorphisms (SNPs) in GWAS that are associated with a disease. Random forests represent a very useful approach for this purpose, using a variable importance score. This importance score has several shortcomings. We propose an alternative importance measure to overcome those shortcomings.
Results: We characterized the effect of multiple SNPs under various models using our proposed importance measure in random forests, which uses maximal conditional chi-square (MCC) as a measure of association between a SNP and the trait conditional on other SNPs. Based on this importance measure, we employed a permutation test to estimate empirical P-values of SNPs. Our method was compared to a univariate test and the permutation test using the Gini and permutation importance. In simulation, the proposed method performed consistently superior to the other methods in identifying of risk SNPs. In a GWAS of age-related macular degeneration, the proposed method confirmed two significant SNPs (at the genome-wide adjusted level of 0.05). Further analysis showed that these two SNPs conformed with a heterogeneity model. Compared with the existing importance measures, the MCC importance measure is more sensitive to complex effects of risk SNPs by utilizing conditional information on different SNPs. The permutation test with the MCC importance measure provides an efficient way to identify candidate SNPs in GWAS and facilitates the understanding of the etiology between genetic variants and complex diseases.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2832825  PMID: 20130032
20.  SNP-SNP Interaction Network in Angiogenesis Genes Associated with Prostate Cancer Aggressiveness 
PLoS ONE  2013;8(4):e59688.
Angiogenesis has been shown to be associated with prostate cancer development. The majority of prostate cancer studies focused on individual single nucleotide polymorphisms (SNPs) while SNP-SNP interactions are suggested having a great impact on unveiling the underlying mechanism of complex disease. Using 1,151 prostate cancer patients in the Cancer Genetic Markers of Susceptibility (CGEMS) dataset, 2,651 SNPs in the angiogenesis genes associated with prostate cancer aggressiveness were evaluated. SNP-SNP interactions were primarily assessed using the two-stage Random Forests plus Multivariate Adaptive Regression Splines (TRM) approach in the CGEMS group, and were then re-evaluated in the Moffitt group with 1,040 patients. For the identified gene pairs, cross-evaluation was applied to evaluate SNP interactions in both study groups. Five SNP-SNP interactions in three gene pairs (MMP16+ ROBO1, MMP16+ CSF1, and MMP16+ EGFR) were identified to be associated with aggressive prostate cancer in both groups. Three pairs of SNPs (rs1477908+ rs1387665, rs1467251+ rs7625555, and rs1824717+ rs7625555) were in MMP16 and ROBO1, one pair (rs2176771+ rs333970) in MMP16 and CSF1, and one pair (rs1401862+ rs6964705) in MMP16 and EGFR. The results suggest that MMP16 may play an important role in prostate cancer aggressiveness. By integrating our novel findings and available biomedical literature, a hypothetical gene interaction network was proposed. This network demonstrates that our identified SNP-SNP interactions are biologically relevant and shows that EGFR may be the hub for the interactions. The findings provide valuable information to identify genotype combinations at risk of developing aggressive prostate cancer and improve understanding on the genetic etiology of angiogenesis associated with prostate cancer aggressiveness.
PMCID: PMC3618555  PMID: 23593148
21.  Genome-wide association study for subclinical atherosclerosis in major arterial territories in the NHLBI's Framingham Heart Study 
BMC Medical Genetics  2007;8(Suppl 1):S4.
Subclinical atherosclerosis (SCA) measures in multiple arterial beds are heritable phenotypes that are associated with increased incidence of cardiovascular disease. We conducted a genome-wide association study (GWAS) for SCA measurements in the community-based Framingham Heart Study.
Over 100,000 single nucleotide polymorphisms (SNPs) were genotyped (Human 100K GeneChip, Affymetrix) in 1345 subjects from 310 families. We calculated sex-specific age-adjusted and multivariable-adjusted residuals in subjects tested for quantitative SCA phenotypes, including ankle-brachial index, coronary artery calcification and abdominal aortic calcification using multi-detector computed tomography, and carotid intimal medial thickness (IMT) using carotid ultrasonography. We evaluated associations of these phenotypes with 70,987 autosomal SNPs with minor allele frequency ≥ 0.10, call rate ≥ 80%, and Hardy-Weinberg p-value ≥ 0.001 in samples ranging from 673 to 984 subjects, using linear regression with generalized estimating equations (GEE) methodology and family-based association testing (FBAT). Variance components LOD scores were also calculated.
There was no association result meeting criteria for genome-wide significance, but our methods identified 11 SNPs with p < 10-5 by GEE and five SNPs with p < 10-5 by FBAT for multivariable-adjusted phenotypes. Among the associated variants were SNPs in or near genes that may be considered candidates for further study, such as rs1376877 (GEE p < 0.000001, located in ABI2) for maximum internal carotid artery IMT and rs4814615 (FBAT p = 0.000003, located in PCSK2) for maximum common carotid artery IMT. Modest significant associations were noted with various SCA phenotypes for variants in previously reported atherosclerosis candidate genes, including NOS3 and ESR1. Associations were also noted of a region on chromosome 9p21 with CAC phenotypes that confirm associations with coronary heart disease and CAC in two recently reported genome-wide association studies. In linkage analyses, several regions of genome-wide linkage were noted, confirming previously reported linkage of internal carotid artery IMT on chromosome 12. All GEE, FBAT and linkage results are provided as an open-access results resource at .
The results from this GWAS generate hypotheses regarding several SNPs that may be associated with SCA phenotypes in multiple arterial beds. Given the number of tests conducted, subsequent independent replication in a staged approach is essential to identify genetic variants that may be implicated in atherosclerosis.
PMCID: PMC1995605  PMID: 17903303
22.  A multilocus genetic risk score for coronary heart disease: case-control and prospective cohort analyses 
Lancet  2010;376(9750):1393-1400.
Comparison of patients with coronary heart disease and controls in genome-wide association studies has revealed several single nucleotide polymorphisms (SNPs) associated with coronary heart disease. We aimed to establish the external validity of these findings and to obtain more precise risk estimates using a prospective cohort design.
We tested 13 recently discovered SNPs for association with coronary heart disease in a case-control design including participants differing from those in the discovery samples (3829 participants with prevalent coronary heart disease and 48 897 controls free of the disease) and a prospective cohort design including 30 725 participants free of cardiovascular disease from Finland and Sweden. We modelled the 13 SNPs as a multilocus genetic risk score and used Cox proportional hazards models to estimate the association of genetic risk score with incident coronary heart disease. For case-control analyses we analysed associations between individual SNPs and quintiles of genetic risk score using logistic regression.
In prospective cohort analyses, 1264 participants had a first coronary heart disease event during a median 10·7 years' follow-up (IQR 6·7–13·6). Genetic risk score was associated with a first coronary heart disease event. When compared with the bottom quintile of genetic risk score, participants in the top quintile were at 1·66-times increased risk of coronary heart disease in a model adjusting for traditional risk factors (95% CI 1·35–2·04, p value for linear trend=7·3×10−10). Adjustment for family history did not change these estimates. Genetic risk score did not improve C index over traditional risk factors and family history (p=0·19), nor did it have a significant effect on net reclassification improvement (2·2%, p=0·18); however, it did have a small effect on integrated discrimination index (0·004, p=0·0006). Results of the case-control analyses were similar to those of the prospective cohort analyses.
Using a genetic risk score based on 13 SNPs associated with coronary heart disease, we can identify the 20% of individuals of European ancestry who are at roughly 70% increased risk of a first coronary heart disease event. The potential clinical use of this panel of SNPs remains to be defined.
The Wellcome Trust; Academy of Finland Center of Excellence for Complex Disease Genetics; US National Institutes of Health; the Donovan Family Foundation.
PMCID: PMC2965351  PMID: 20971364
23.  The Relationship Between Polymorphisms on Chromosome 9p21 and Age of Onset of Coronary Heart Disease in Black and White Women 
Aim: Genome-wide association studies have identified variants on chromosome 9p21 that are associated with coronary heart disease (CHD). The relationship between these variants and the age of onset of CHD is less clear. The aim of this study was to examine the allelic frequencies and haplotype structure of eight single-nucleotide polymorphisms (SNPs) on chromosome 9p21 in ethnically diverse women. We also explored the relationship between 9p21 SNPs and the age of CHD onset. Results: There was considerable interethnic allelic and haplotype diversity across the 9p21 locus with only two SNPs (rs10757274 and rs4977574) in perfect linkage disequilibrium in both races, and only a small proportion of the haplotypes shared between the racial groups. With the exception of rs1333040, whites with at least one copy of the 9p21 SNP risk alleles were found to have CHD from 1.45 (rs10116277) to 4.77 (rs2383206) years earlier than those with the wild-type alleles. Blacks carrying at least one copy of the risk allele (92%) for rs1333040 had a CHD age of onset that was 6.5 years earlier than those with the wild-type alleles. Conclusions: Different variants on chromosome 9p21 may influence CHD age of onset in whites and blacks.
PMCID: PMC3101922  PMID: 21375403
24.  A multilevel linear mixed model of the association between candidate genes and weight and body mass index using the Framingham longitudinal family data 
BMC Proceedings  2009;3(Suppl 7):S115.
Obesity has become an epidemic in many countries and is one of the major risk conditions for disease including type 2 diabetes, coronary heart disease, stroke, dyslipidemia, and hypertension. Recent genome-wide association studies have identified two genes (FTO and near MC4R) that were unequivocally associated with body mass index (BMI) and obesity. For the Genetic Analysis Workshop 16, data from the Framingham Heart Study were made available, including longitudinal anthropometric and metabolic traits for 7130 Caucasian individuals over three generations, each with follow-up data at up to four time points. We explored the associations between four single-nucleotide polymorphisms (SNPs) on FTO (rs1121980, rs9939609) or near MC4R (rs17782313, rs17700633) with weight and BMI under an additive model. We applied multilevel linear mixed model for continuous outcomes, using the Affymetrix 500k genome-wide genotype data for the four SNPs. The results of the multilevel modeling in the entire sample indicated that the minor alleles of the four SNPs were associated with higher weight and higher BMI. The most significant associations were between rs9939609 and weight (p = 4.7 × 10-6) and BMI (p = 8.9 × 10-8). The results also showed that, for SNPs at FTO, the homozygotes for the minor allele had the most pronounced increase in weight and BMI, while the common allele homozygotes gained less weight and BMI during the follow-up period. Linkage disequilibrium (LD) between the two FTO SNPs was strong (D' = 0.997, r2 = 0.875) but their haplotype was not significantly associated with either weight or BMI. The two SNPs near MC4R were in weak LD (D' = 0.487, r2 = 0.166).
PMCID: PMC2795887  PMID: 20017980
25.  Genome-wide association to body mass index and waist circumference: the Framingham Heart Study 100K project 
BMC Medical Genetics  2007;8(Suppl 1):S18.
Obesity is related to multiple cardiovascular disease (CVD) risk factors as well as CVD and has a strong familial component. We tested for association between SNPs on the Affymetrix 100K SNP GeneChip and measures of adiposity in the Framingham Heart Study.
A total of 1341 Framingham Heart Study participants in 310 families genotyped with the Affymetrix 100K SNP GeneChip had adiposity traits measured over 30 years of follow up. Body mass index (BMI), waist circumference (WC), weight change, height, and radiographic measures of adiposity (subcutaneous adipose tissue, visceral adipose tissue, waist circumference, sagittal height) were measured at multiple examination cycles. Multivariable-adjusted residuals, adjusting for age, age-squared, sex, smoking, and menopausal status, were evaluated in association with the genotype data using additive Generalized Estimating Equations (GEE) and Family Based Association Test (FBAT) models. We prioritized mean BMI over offspring examinations (1–7) and cohort examinations (10, 16, 18, 20, 22, 24, 26) and mean WC over offspring examinations (4–7) for presentation. We evaluated associations with 70,987 SNPs on autosomes with minor allele frequencies of at least 0.10, Hardy-Weinberg equilibrium p ≥ 0.001, and call rates of at least 80%.
The top SNPs to be associated with mean BMI and mean WC by GEE were rs110683 (p-value 1.22*10-7) and rs4471028 (p-values 1.96*10-7). Please see for the complete set of results. We were able to validate SNPs in known genes that have been related to BMI or other adiposity traits, including the ESR1 Xba1 SNP, PPARG, and ADIPOQ.
Adiposity traits are associated with SNPs on the Affymetrix 100K SNP GeneChip. Replication of these initial findings is necessary. These data will serve as a resource for replication as more genes become identified with BMI and WC.
PMCID: PMC1995618  PMID: 17903300

