PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1468255)

Clipboard (0)
None

Related Articles

1.  A Genome-Wide Association Study of Pulmonary Function Measures in the Framingham Heart Study 
PLoS Genetics  2009;5(3):e1000429.
The ratio of forced expiratory volume in one second to forced vital capacity (FEV1/FVC) is a measure used to diagnose airflow obstruction and is highly heritable. We performed a genome-wide association study in 7,691 Framingham Heart Study participants to identify single-nucleotide polymorphisms (SNPs) associated with the FEV1/FVC ratio, analyzed as a percent of the predicted value. Identified SNPs were examined in an independent set of 835 Family Heart Study participants enriched for airflow obstruction. Four SNPs in tight linkage disequilibrium on chromosome 4q31 were associated with the percent predicted FEV1/FVC ratio with p-values of genome-wide significance in the Framingham sample (best p-value = 3.6e-09). One of the four chromosome 4q31 SNPs (rs13147758; p-value 2.3e-08 in Framingham) was genotyped in the Family Heart Study and produced evidence of association with the same phenotype, percent predicted FEV1/FVC (p-value = 2.0e-04). The effect estimates for association in the Framingham and Family Heart studies were in the same direction, with the minor allele (G) associated with higher FEV1/FVC ratio levels. Results from the Family Heart Study demonstrated that the association extended to FEV1 and dichotomous airflow obstruction phenotypes, particularly among smokers. The SNP rs13147758 was associated with the percent predicted FEV1/FVC ratio in independent samples from the Framingham and Family Heart Studies producing a combined p-value of 8.3e-11, and this region of chromosome 4 around 145.68 megabases was associated with COPD in three additional populations reported in the accompanying manuscript. The associated SNPs do not lie within a gene transcript but are near the hedgehog-interacting protein (HHIP) gene and several expressed sequence tags cloned from fetal lung. Though it is unclear what gene or regulatory effect explains the association, the region warrants further investigation.
Author Summary
Cigarette smoking is the primary risk factor for impaired lung function, yet only 20% of smokers develop chronic obstructive pulmonary disease (COPD). This observation, along with family studies of lung function and COPD, suggests that genetic factors influence susceptibility to cigarette smoke. We examined the relationship between common genetic variants and measures of lung function in a sample of 7,691 participants from the Framingham Heart Study and confirmed our observations in 835 participants from the Family Heart Study selected to include cases of airflow obstruction. We identified a variant on chromosome 4 that was strongly associated with FEV1/FVC in the Framingham Study and confirmed the association in the Family Heart Study. The accompanying manuscript identified the same region to be associated with COPD. Several interesting genes are present in the region that we identified, including a gene (HHIP) interacting with a biological pathway involved in lung development, but it is not yet clear which gene in the region explains the association. Our results identified a region of chromosome 4 that warrants further study to understand the genetic effects influencing lung function.
doi:10.1371/journal.pgen.1000429
PMCID: PMC2652834  PMID: 19300500
2.  Screening large-scale association study data: exploiting interactions using random forests 
BMC Genetics  2004;5:32.
Background
Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for futher study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction.
Results
Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact.
Conclusions
In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.
doi:10.1186/1471-2156-5-32
PMCID: PMC545646  PMID: 15588316
3.  Reduced Glomerular Filtration Rate and Its Association with Clinical Outcome in Older Patients at Risk of Vascular Events: Secondary Analysis 
PLoS Medicine  2009;6(1):e1000016.
Background
Reduced glomerular filtration rate (GFR) is associated with increased cardiovascular risk in young and middle aged individuals. Associations with cardiovascular disease and mortality in older people are less clearly established. We aimed to determine the predictive value of the GFR for mortality and morbidity using data from the 5,804 participants randomized in the Prospective Study of Pravastatin in the Elderly at Risk (PROSPER).
Methods and Findings
Glomerular filtration rate was estimated (eGFR) using the Modification of Diet in Renal Disease equation and was categorized in the ranges ([20–40], [40–50], [50–60]) ≥ 60 ml/min/1.73 m2. Baseline risk factors were analysed by category of eGFR, with and without adjustment for other risk factors. The associations between baseline eGFR and morbidity and mortality outcomes, accrued after an average of 3.2 y, were investigated using Cox proportional hazard models adjusting for traditional risk factors. We tested for evidence of an interaction between the benefit of statin treatment and baseline eGFR status. Age, low-density lipoprotein (LDL) and high-density lipoprotein (HDL) cholesterol, C-reactive protein (CRP), body mass index, fasting glucose, female sex, histories of hypertension and vascular disease were associated with eGFR (p = 0.001 or less) after adjustment for other risk factors. Low eGFR was independently associated with risk of all cause mortality, vascular mortality, and other noncancer mortality and with fatal and nonfatal coronary and heart failure events (hazard ratios adjusted for CRP and other risk factors (95% confidence intervals [CIs]) for eGFR < 40 ml/min/1.73m2 relative to eGFR ≥ 60 ml/min/1.73m2 respectively 2.04 (1.48–2.80), 2.37 (1.53–3.67), 3.52 (1.78–6.96), 1.64 (1.18–2.27), 3.31 (2.03–5.41). There were no nominally statistically significant interactions (p < 0.05) between randomized treatment allocation and eGFR for clinical outcomes, with the exception of the outcome of coronary heart disease death or nonfatal myocardial infarction (p = 0.021), with the interaction suggesting increased benefit of statin treatment in subjects with impaired GFRs.
Conclusions
We have established that, in an elderly population over the age of 70 y, impaired GFR is associated with female sex, with presence of vascular disease, and with levels of other risk factors that would be associated with increased risk of vascular disease. Further, impaired GFR is independently associated with significant levels of increased risk of all cause mortality and fatal vascular events and with composite fatal and nonfatal coronary and heart failure outcomes. Our analyses of the benefits of statin treatment in relation to baseline GFR suggest that there is no reason to exclude elderly patients with impaired renal function from treatment with a statin.
Using data from the PROSPER trial, Ian Ford and colleagues investigate whether reduced glomerular filtration rate is associated with cardiovascular and mortality risk among elderly people.
Editors' Summary
Background.
Cardiovascular disease (CVD)—disease that affects the heart and/or the blood vessels—is a common cause of death in developed countries. In the USA, for example, the single leading cause of death is coronary heart disease, a CVD in which narrowing of the heart's blood vessels slows or stops the blood supply to the heart and eventually causes a heart attack. Other types of CVD include stroke (in which narrowing of the blood vessels interrupts the brain's blood supply) and heart failure (a condition in which the heart can no longer pump enough blood to the rest of the body). Many factors increase the risk of developing CVD, including high blood pressure (hypertension), high blood cholesterol, having diabetes, smoking, and being overweight. Tools such as the “Framingham risk calculator” assess an individual's overall CVD risk by taking these and other risk factors into account. CVD risk can be minimized by taking drugs to reduce blood pressure or cholesterol levels (for example, pravastatin) and by making lifestyle changes.
Why Was This Study Done?
Another potential risk factor for CVD is impaired kidney (renal) function. In healthy people, the kidneys filter waste products and excess fluid out of the blood. A reduced “estimated glomerular filtration rate” (eGFR), which indicates impaired renal function, is associated with increased CVD in young and middle-aged people and increased all-cause and cardiovascular death in people who have vascular disease. But is reduced eGFR also associated with CVD and death in older people? If it is, it would be worth encouraging elderly people with reduced eGFR to avoid other CVD risk factors. In this study, the researchers determine the predictive value of eGFR for all-cause and vascular mortality (deaths caused by CVD) and for incident vascular events (a first heart attack, stroke, or heart failure) using data from the Prospective Study of Pravastatin in the Elderly at Risk (PROSPER). This clinical trial examined pravastatin's effects on CVD development among 70–82 year olds with pre-existing vascular disease or an increased risk of CVD because of smoking, hypertension, or diabetes.
What Did the Researchers Do and Find?
The trial participants were divided into four groups based on their eGFR at the start of the study. The researchers then investigated the association between baseline CVD risk factors and baseline eGFR and between baseline eGFR and vascular events and deaths that occurred during the 3-year study. Several established CVD risk factors were associated with a reduced eGFR after allowing for other risk factors. In addition, people with a low eGFR (between 20 and 40 units) were twice as likely to die from any cause as people with an eGFR above 60 units (the normal eGFR for a young person is 100 units; eGFR decreases with age) and more than three times as likely to have nonfatal coronary heart disease or heart failure. A low eGFR also increased the risk of vascular mortality, other noncancer deaths, and fatal coronary heart disease and heart failure. Finally, pravastatin treatment reduced coronary heart disease deaths and nonfatal heart attacks most effectively among participants with the greatest degree of eGFR impairment.
What Do These Findings Mean?
These findings suggest that, in elderly people, impaired renal function is associated with levels of established CVD risk factors that increase the risk of vascular disease. They also suggest that impaired kidney function increases the risk of all-cause mortality, fatal vascular events, and fatal and nonfatal coronary heat disease and heart failure. Because the study participants were carefully chosen for inclusion in PROSPER, these findings may not be generalizable to all elderly people with vascular disease or vascular disease risk factors. Nevertheless, increased efforts should probably be made to encourage elderly people with reduced eGFR and other vascular risk factors to make lifestyle changes to reduce their overall CVD risk. Finally, although the effect of statins in elderly patients with renal dysfunction needs to be examined further, these findings suggest that this group of patients should benefit at least as much from statins as elderly patients with healthy kidneys.
Additional Information.
Please access these Web sites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.1000016.
The MedlinePlus Encyclopedia has pages on coronary heart disease, stroke, and heart failure (in English and Spanish)
MedlinePlus provides links to many other sources of information on heart disease, vascular disease, and stroke (in English and Spanish)
The US National Institute of Diabetes and Digestive and Kidney Diseases provides information on how the kidneys work and what can go wrong with them, including a list of links to further information about kidney disease
The American Heart Association provides information on all aspects of cardiovascular disease for patients, caregivers, and professionals (in several languages)
More information about PROSPER is available on the Web site of the Vascular Biochemistry Department of the University of Glasgow
doi:10.1371/journal.pmed.1000016
PMCID: PMC2628400  PMID: 19166266
4.  The Effect of Chromosome 9p21 Variants on Cardiovascular Disease May Be Modified by Dietary Intake: Evidence from a Case/Control and a Prospective Study 
PLoS Medicine  2011;8(10):e1001106.
Ron Do and colleagues find that a prudent diet high in raw vegetables may modify the increased genetic risk of cardiovascular disease conferred by the chromosome 9p21 SNP.
Background
One of the most robust genetic associations for cardiovascular disease (CVD) is the Chromosome 9p21 region. However, the interaction of this locus with environmental factors has not been extensively explored. We investigated the association of 9p21 with myocardial infarction (MI) in individuals of different ethnicities, and tested for an interaction with environmental factors.
Methods and Findings
We genotyped four 9p21 SNPs in 8,114 individuals from the global INTERHEART study. All four variants were associated with MI, with odds ratios (ORs) of 1.18 to 1.20 (1.85×10−8≤p≤5.21×10−7). A significant interaction (p = 4.0×10−4) was observed between rs2383206 and a factor-analysis-derived “prudent” diet pattern score, for which a major component was raw vegetables. An effect of 9p21 on MI was observed in the group with a low prudent diet score (OR = 1.32, p = 6.82×10−7), but the effect was diminished in a step-wise fashion in the medium (OR = 1.17, p = 4.9×10−3) and high prudent diet scoring groups (OR = 1.02, p = 0.68) (p = 0.014 for difference). We also analyzed data from 19,129 individuals (including 1,014 incident cases of CVD) from the prospective FINRISK study, which used a closely related dietary variable. In this analysis, the 9p21 risk allele demonstrated a larger effect on CVD risk in the groups with diets low or average for fresh vegetables, fruits, and berries (hazard ratio [HR] = 1.22, p = 3.0×10−4, and HR = 1.35, p = 4.1×10−3, respectively) compared to the group with high consumption of these foods (HR = 0.96, p = 0.73) (p = 0.0011 for difference). The combination of the least prudent diet and two copies of the risk allele was associated with a 2-fold increase in risk for MI (OR = 1.98, p = 2.11×10−9) in the INTERHEART study and a 1.66-fold increase in risk for CVD in the FINRISK study (HR = 1.66, p = 0.0026).
Conclusions
The risk of MI and CVD conferred by Chromosome 9p21 SNPs appears to be modified by a prudent diet high in raw vegetables and fruits.
Please see later in the article for the Editors' Summary
Editors' Summary
Background
Cardiovascular diseases (CVDs)—diseases that affect the heart and/or the blood vessels—are a leading cause of illness and death worldwide. In the United States, for example, the leading cause of death is coronary heart disease, a CVD in which narrowing of the heart's blood vessels by fatty deposits slows the blood supply to the heart and may eventually cause a heart attack (myocardial infarction, or MI); the third leading cause of death in the US is stroke, a CVD in which the brain's blood supply is interrupted. Environmental factors such as diet, physical activity, and smoking alter a person's risk of developing CVD. In addition, certain genetic variants (alterations in the DNA that forms the body's blueprint; DNA is packed into structures called chromosomes) alter the risk of developing CVD and are passed from parent to child. Thus, in CVD, as in most common diseases, both genetics and the environment play a role.
Why Was This Study Done?
Recent studies have identified several genetic variants that are associated with an increased risk of developing CVD. One of the most robust of these genetic associations is a cluster of single nucleotide polymorphisms (SNPs, differences in a single DNA building block) in a chromosomal region (locus) called 9p21. So far, this association has been mainly studied in European populations. Moreover, the interaction of this locus with environmental factors has not been extensively studied. A better understanding of how 9p21 variants affect CVD risk in people of different ethnicities and of the interaction between this locus and environmental factors could allow the development of targeted strategies for the prevention of CVD. In this study, the researchers investigate the association of 9p21 risk variants with CVD in people of different ethnicities and test for an interaction between this locus and environmental factors.
What Did the Researchers Do and Find?
The researchers assessed four 9p21 SNPs in people enrolled in the INTERHEART study, a global retrospective case-control study that investigated potential MI risk factors by comparing people who had had an acute non-fatal MI with similar people without heart disease. All four SNP risk variants increased the risk of MI by about a fifth. However, the effect of the SNPs on MI was influenced by the “prudent” diet pattern score of the INTERHEART participants, a score that includes fresh fruit and vegetable intake as recorded in food frequency questionnaires. That is, the risk of MI in people carrying SNP risk variants was influenced by their diet. The strongest interaction was seen with an SNP called rs2383206, but although rs2383206 carriers who ate a diet poor in fruits and vegetables had a higher risk of MI than people with a similar diet who did not carry this SNP, rs2383206 carriers and non-carriers who ate a fruit- and vegetable-rich diet had a comparable MI risk. Overall, the combination of the least “prudent” diet and two copies of the risk variant (human cells contain two complete sets of chromosomes) was associated with a two-fold increase in risk for MI in the INTERHEART study. Additionally, data collected in the FINRISK study, which characterized healthy individuals living in Finland at baseline and then followed them to see whether they developed CVD, revealed a similar interaction between diet and 9p21 SNPs.
What Do These Findings Mean?
These findings suggest that the risk of CVD conferred by chromosome 9p21 SNPs may be influenced by diet in multiple ethnic groups. Importantly, they suggest that the deleterious effect of 9p21 SNPs on CVD might be mitigated by consuming a diet rich in fresh fruits and vegetables. The accuracy of these findings may be affected by recall bias in the INTERHEART study (that is, some people may not have remembered their diet accurately) and by the small number of CVD cases in the FINRISK study. Nevertheless, these findings suggest that gene–environment interactions are important drivers of CVD, and they raise the possibility that a sound diet can mediate the effects of 9p21 SNPs.
Additional Information
Please access these websites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.1001106.
The American Heart Association provides information about many types of cardiovascular disease for patients, caregivers, and professionals and tips on keeping the heart healthy
The UK National Health Service Choices website provides information about cardiovascular disease and stroke
Information is available from the British Heart Foundation on heart disease and keeping the heart healthy
The US National Heart Lung and Blood Institute provides information on a wide range of cardiovascular diseases
MedlinePlus provides links to many other sources of information on heart diseases, vascular diseases, and stroke (in English and Spanish)
The US Centers for Disease Control and Prevention has a simple fact sheet on gene-environment interactions; the US National Institute of Environmental Health Sciences provides links to other information on gene-environment interactions
More information is available on the INTERHEART study and on the FINRISK study
doi:10.1371/journal.pmed.1001106
PMCID: PMC3191151  PMID: 22022235
5.  A Genome-Wide Association Study in Chronic Obstructive Pulmonary Disease (COPD): Identification of Two Major Susceptibility Loci 
PLoS Genetics  2009;5(3):e1000421.
There is considerable variability in the susceptibility of smokers to develop chronic obstructive pulmonary disease (COPD). The only known genetic risk factor is severe deficiency of α1-antitrypsin, which is present in 1–2% of individuals with COPD. We conducted a genome-wide association study (GWAS) in a homogenous case-control cohort from Bergen, Norway (823 COPD cases and 810 smoking controls) and evaluated the top 100 single nucleotide polymorphisms (SNPs) in the family-based International COPD Genetics Network (ICGN; 1891 Caucasian individuals from 606 pedigrees) study. The polymorphisms that showed replication were further evaluated in 389 subjects from the US National Emphysema Treatment Trial (NETT) and 472 controls from the Normative Aging Study (NAS) and then in a fourth cohort of 949 individuals from 127 extended pedigrees from the Boston Early-Onset COPD population. Logistic regression models with adjustments of covariates were used to analyze the case-control populations. Family-based association analyses were conducted for a diagnosis of COPD and lung function in the family populations. Two SNPs at the α-nicotinic acetylcholine receptor (CHRNA 3/5) locus were identified in the genome-wide association study. They showed unambiguous replication in the ICGN family-based analysis and in the NETT case-control analysis with combined p-values of 1.48×10−10, (rs8034191) and 5.74×10−10 (rs1051730). Furthermore, these SNPs were significantly associated with lung function in both the ICGN and Boston Early-Onset COPD populations. The C allele of the rs8034191 SNP was estimated to have a population attributable risk for COPD of 12.2%. The association of hedgehog interacting protein (HHIP) locus on chromosome 4 was also consistently replicated, but did not reach genome-wide significance levels. Genome-wide significant association of the HHIP locus with lung function was identified in the Framingham Heart study (Wilk et al., companion article in this issue of PLoS Genetics; doi:10.1371/journal.pgen.1000429). The CHRNA 3/5 and the HHIP loci make a significant contribution to the risk of COPD. CHRNA3/5 is the same locus that has been implicated in the risk of lung cancer.
Author Summary
There is considerable variability in the susceptibility of smokers to develop chronic obstructive pulmonary disease (COPD), which is a heritable multi-factorial trait. Identifying the genetic determinants of COPD risk will have tremendous public health importance. This study describes the first genome-wide association study (GWAS) in COPD. We conducted a GWAS in a homogenous case-control cohort from Norway and evaluated the top 100 single nucleotide polymorphisms in the family-based International COPD Genetics Network. The polymorphisms that showed replication were further evaluated in subjects from the US National Emphysema Treatment Trial and controls from the Normative Aging Study and then in a fourth cohort of extended pedigrees from the Boston Early-Onset COPD population. Two polymorphisms in the α-nicotinic acetylcholine receptor 3/5 locus on chromosome 15 showed unambiguous evidence of association with COPD. This locus has previously been implicated in both smoking behavior and risk of lung cancer, suggesting the possibility of multiple functional polymorphisms in the region or a single polymorphism with wide phenotypic consequences. The hedgehog interacting protein (HHIP) locus on chromosome 4, which is associated with COPD, is also a significant risk locus for COPD.
doi:10.1371/journal.pgen.1000421
PMCID: PMC2650282  PMID: 19300482
6.  Data mining of high density genomic variant data for prediction of Alzheimer's disease risk 
BMC Medical Genetics  2012;13:7.
Background
The discovery of genetic associations is an important factor in the understanding of human illness to derive disease pathways. Identifying multiple interacting genetic mutations associated with disease remains challenging in studying the etiology of complex diseases. And although recently new single nucleotide polymorphisms (SNPs) at genes implicated in immune response, cholesterol/lipid metabolism, and cell membrane processes have been confirmed by genome-wide association studies (GWAS) to be associated with late-onset Alzheimer's disease (LOAD), a percentage of AD heritability continues to be unexplained. We try to find other genetic variants that may influence LOAD risk utilizing data mining methods.
Methods
Two different approaches were devised to select SNPs associated with LOAD in a publicly available GWAS data set consisting of three cohorts. In both approaches, single-locus analysis (logistic regression) was conducted to filter the data with a less conservative p-value than the Bonferroni threshold; this resulted in a subset of SNPs used next in multi-locus analysis (random forest (RF)). In the second approach, we took into account prior biological knowledge, and performed sample stratification and linkage disequilibrium (LD) in addition to logistic regression analysis to preselect loci to input into the RF classifier construction step.
Results
The first approach gave 199 SNPs mostly associated with genes in calcium signaling, cell adhesion, endocytosis, immune response, and synaptic function. These SNPs together with APOE and GAB2 SNPs formed a predictive subset for LOAD status with an average error of 9.8% using 10-fold cross validation (CV) in RF modeling. Nineteen variants in LD with ST5, TRPC1, ATG10, ANO3, NDUFA12, and NISCH respectively, genes linked directly or indirectly with neurobiology, were identified with the second approach. These variants were part of a model that included APOE and GAB2 SNPs to predict LOAD risk which produced a 10-fold CV average error of 17.5% in the classification modeling.
Conclusions
With the two proposed approaches, we identified a large subset of SNPs in genes mostly clustered around specific pathways/functions and a smaller set of SNPs, within or in proximity to five genes not previously reported, that may be relevant for the prediction/understanding of AD.
doi:10.1186/1471-2350-13-7
PMCID: PMC3355044  PMID: 22273362
Late-Onset Alzheimer's Disease; GWAS; SNPs; Random Forest
7.  Performance of random forest when SNPs are in linkage disequilibrium 
BMC Bioinformatics  2009;10:78.
Background
Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF.
Results
We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.
Conclusion
Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.
doi:10.1186/1471-2105-10-78
PMCID: PMC2666661  PMID: 19265542
8.  Validation Study of Genetic Associations with Coronary Artery Disease on Chromosome 3q13-21 and Potential Effect Modification by Smoking 
Annals of human genetics  2009;73(Pt 6):551-558.
Summary
The CATHGEN study reported associations of chromosome 3q13-21 genes (KALRN, MYLK, CDGAP, and GATA2) with early-onset coronary artery disease (CAD). This study attempted to independently validate those associations. Eleven single nucleotide polymorphisms (SNPs) were examined (rs10934490, rs16834817, rs6810298, rs9289231, rs12637456, rs1444768, rs1444754, rs4234218, rs2335052, rs3803, rs2713604) in patients (N=1,618) from the Intermountain Heart Collaborative Study (IHCS). Given the higher smoking prevalence in CATHGEN than IHCS (41% vs 11% in controls, 74% vs 29% in cases), smoking stratification and genotype-smoking interactions were evaluated. Suggestive association was found for GATA2 (rs2713604, p=0.057, OR=1.2). Among smokers, associations were found in CDGAP (rs10934490, p=0.019, OR=1.6) and KALRN (rs12637456, p=0.011, OR=2.0) and suggestive association in MYLK (rs16834871, p=0.051, OR=1.8, adjusting for gender). No SNP association was found among non-smokers, but smoking/SNP interactions were detected for CDGAP (rs10934491, p=0.017) and KALRN (rs12637456, p=0.010). Similar differences in SNP effects by smoking status were observed on re-analysis of CATHGEN. CAD associations were suggestive for GATA2 and among smokers significant post hoc associations were found in KALRN, MYLK, and CDGAP. Genetic risk conferred by some of these genes may be modified by smoking. Future CAD association studies of these and other genes should evaluate effect modification by smoking.
doi:10.1111/j.1469-1809.2009.00540.x
PMCID: PMC2764812  PMID: 19706030
coronary disease; genetic association; replication study; smoking
9.  Finding type 2 diabetes causal single nucleotide polymorphism combinations and functional modules from genome-wide association data 
Background
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
Methods
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
Results
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
Conclusions
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
doi:10.1186/1472-6947-13-S1-S3
PMCID: PMC3618247  PMID: 23566118
10.  SNPInterForest: A new method for detecting epistatic interactions 
BMC Bioinformatics  2011;12:469.
Background
Multiple genetic factors and their interactive effects are speculated to contribute to complex diseases. Detecting such genetic interactive effects, i.e., epistatic interactions, however, remains a significant challenge in large-scale association studies.
Results
We have developed a new method, named SNPInterForest, for identifying epistatic interactions by extending an ensemble learning technique called random forest. Random forest is a predictive method that has been proposed for use in discovering single-nucleotide polymorphisms (SNPs), which are most predictive of the disease status in association studies. However, it is less sensitive to SNPs with little marginal effect. Furthermore, it does not natively exhibit information on interaction patterns of susceptibility SNPs. We extended the random forest framework to overcome the above limitations by means of (i) modifying the construction of the random forest and (ii) implementing a procedure for extracting interaction patterns from the constructed random forest. The performance of the proposed method was evaluated by simulated data under a wide spectrum of disease models. SNPInterForest performed very well in successfully identifying pure epistatic interactions with high precision and was still more than capable of concurrently identifying multiple interactions under the existence of genetic heterogeneity. It was also performed on real GWAS data of rheumatoid arthritis from the Wellcome Trust Case Control Consortium (WTCCC), and novel potential interactions were reported.
Conclusions
SNPInterForest, offering an efficient means to detect epistatic interactions without statistical analyses, is promising for practical use as a way to reveal the epistatic interactions involved in common complex diseases.
doi:10.1186/1471-2105-12-469
PMCID: PMC3260223  PMID: 22151604
11.  Detecting significant single-nucleotide polymorphisms in a rheumatoid arthritis study using random forests 
BMC Proceedings  2009;3(Suppl 7):S69.
Random forest is an efficient approach for investigating not only the effects of individual markers on a trait but also the effect of the interactions among the markers in genetic association studies. This approach is especially appealing for the analysis of genome-wide data, such as those obtained from gene expression/single-nucleotide polymorphism (SNP) array experiments in which the number of candidate genes/SNPs is vast. We applied this approach to the Genetic Analysis Workshop 16 Problem 1 data to identify SNPs that contribute to rheumatoid arthritis. The random forest computed a raw importance score for each SNP marker, where higher importance score suggests higher level of association between the marker and the trait. The significance level of the association was determined empirically by repeatedly reapplying the random forest on randomly generated data under the null hypothesis that no association exists between the markers and the trait. Using random forest, we were able to identify 228 significant SNPs (at the genome-wide significant level of 0.05) across the whole genome, over two-thirds of which are located on chromosome 6, especially clustered in the region of 6p21 containing the human leukocyte antigen (HLA) genes, such as gene HLA-DRB1 and HLA-DRA. Further analysis of this region indicates a strong association to the rheumatoid arthritis status.
PMCID: PMC2795970  PMID: 20018063
12.  Use of Wrapper Algorithms Coupled with a Random Forests Classifier for Variable Selection in Large-Scale Genomic Association Studies 
Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a “wrapper” strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case–cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.
doi:10.1089/cmb.2008.0037
PMCID: PMC2980837  PMID: 20047492
coronary heart disease; genome-wide association studies; Random Forests classifier; SNPs; variable selection
13.  Use of Wrapper Algorithms Coupled with a Random Forests Classifier for Variable Selection in Large-Scale Genomic Association Studies 
Journal of Computational Biology  2009;16(12):1705-1718.
Abstract
Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a “wrapper” strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case-cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.
doi:10.1089/cmb.2008.0037
PMCID: PMC2980837  PMID: 20047492
coronary heart disease; genome-wide association studies; Random Forests classifier; SNPs; variable selection
14.  Genome-Wide Joint Meta-Analysis of SNP and SNP-by-Smoking Interaction Identifies Novel Loci for Pulmonary Function 
Hancock, Dana B. | Artigas, María Soler | Gharib, Sina A. | Henry, Amanda | Manichaikul, Ani | Ramasamy, Adaikalavan | Loth, Daan W. | Imboden, Medea | Koch, Beate | McArdle, Wendy L. | Smith, Albert V. | Smolonska, Joanna | Sood, Akshay | Tang, Wenbo | Wilk, Jemma B. | Zhai, Guangju | Zhao, Jing Hua | Aschard, Hugues | Burkart, Kristin M. | Curjuric, Ivan | Eijgelsheim, Mark | Elliott, Paul | Gu, Xiangjun | Harris, Tamara B. | Janson, Christer | Homuth, Georg | Hysi, Pirro G. | Liu, Jason Z. | Loehr, Laura R. | Lohman, Kurt | Loos, Ruth J. F. | Manning, Alisa K. | Marciante, Kristin D. | Obeidat, Ma'en | Postma, Dirkje S. | Aldrich, Melinda C. | Brusselle, Guy G. | Chen, Ting-hsu | Eiriksdottir, Gudny | Franceschini, Nora | Heinrich, Joachim | Rotter, Jerome I. | Wijmenga, Cisca | Williams, O. Dale | Bentley, Amy R. | Hofman, Albert | Laurie, Cathy C. | Lumley, Thomas | Morrison, Alanna C. | Joubert, Bonnie R. | Rivadeneira, Fernando | Couper, David J. | Kritchevsky, Stephen B. | Liu, Yongmei | Wjst, Matthias | Wain, Louise V. | Vonk, Judith M. | Uitterlinden, André G. | Rochat, Thierry | Rich, Stephen S. | Psaty, Bruce M. | O'Connor, George T. | North, Kari E. | Mirel, Daniel B. | Meibohm, Bernd | Launer, Lenore J. | Khaw, Kay-Tee | Hartikainen, Anna-Liisa | Hammond, Christopher J. | Gläser, Sven | Marchini, Jonathan | Kraft, Peter | Wareham, Nicholas J. | Völzke, Henry | Stricker, Bruno H. C. | Spector, Timothy D. | Probst-Hensch, Nicole M. | Jarvis, Deborah | Jarvelin, Marjo-Riitta | Heckbert, Susan R. | Gudnason, Vilmundur | Boezen, H. Marike | Barr, R. Graham | Cassano, Patricia A. | Strachan, David P. | Fornage, Myriam | Hall, Ian P. | Dupuis, Josée | Tobin, Martin D. | London, Stephanie J.
PLoS Genetics  2012;8(12):e1003098.
Genome-wide association studies have identified numerous genetic loci for spirometic measures of pulmonary function, forced expiratory volume in one second (FEV1), and its ratio to forced vital capacity (FEV1/FVC). Given that cigarette smoking adversely affects pulmonary function, we conducted genome-wide joint meta-analyses (JMA) of single nucleotide polymorphism (SNP) and SNP-by-smoking (ever-smoking or pack-years) associations on FEV1 and FEV1/FVC across 19 studies (total N = 50,047). We identified three novel loci not previously associated with pulmonary function. SNPs in or near DNER (smallest PJMA = 5.00×10−11), HLA-DQB1 and HLA-DQA2 (smallest PJMA = 4.35×10−9), and KCNJ2 and SOX9 (smallest PJMA = 1.28×10−8) were associated with FEV1/FVC or FEV1 in meta-analysis models including SNP main effects, smoking main effects, and SNP-by-smoking (ever-smoking or pack-years) interaction. The HLA region has been widely implicated for autoimmune and lung phenotypes, unlike the other novel loci, which have not been widely implicated. We evaluated DNER, KCNJ2, and SOX9 and found them to be expressed in human lung tissue. DNER and SOX9 further showed evidence of differential expression in human airway epithelium in smokers compared to non-smokers. Our findings demonstrated that joint testing of SNP and SNP-by-environment interaction identified novel loci associated with complex traits that are missed when considering only the genetic main effects.
Author Summary
Measures of pulmonary function provide important clinical tools for evaluating lung disease and its progression. Genome-wide association studies have identified numerous genetic risk factors for pulmonary function but have not considered interaction with cigarette smoking, which has consistently been shown to adversely impact pulmonary function. In over 50,000 study participants of European descent, we applied a recently developed joint meta-analysis method to simultaneously test associations of gene and gene-by-smoking interactions in relation to two major clinical measures of pulmonary function. Using this joint method to incorporate genetic main effects plus gene-by-smoking interaction, we identified three novel gene regions not previously related to pulmonary function: (1) DNER, (2) HLA-DQB1 and HLA-DQA2, and (3) KCNJ2 and SOX9. Expression analyses in human lung tissue from ours or prior studies indicate that these regions contain genes that are plausibly involved in pulmonary function. This work highlights the utility of employing novel methods for incorporating environmental interaction in genome-wide association studies to identify novel genetic regions.
doi:10.1371/journal.pgen.1003098
PMCID: PMC3527213  PMID: 23284291
15.  Genome-wide Association Analyses Suggested a Novel Mechanism for Smoking Behavior Regulated by IL15 
Molecular psychiatry  2009;14(7):668-680.
Cigarette smoking is the leading preventable cause of death in the US. Although smoking behavior has a significant genetic determination, the specific genes and associated mechanisms underlying smoking behavior are largely unknown. Here, we performed a genome-wide association study on smoking behavior in 840 Caucasians, including 417 males and 423 females, in which we examined ∼380,000 SNPs. We found that a cluster of nine SNPs upstream from the IL15 gene were associated with smoking status in males, with the most significant SNP, rs4956302, achieving a p value (8.80×10−8) of genome-wide significance. Another SNP, rs17354547, that is highly conserved across multiple species, achieved a p value of 5.65×10−5. These two SNPs, together with two additional SNPs (rs1402812 and rs4956396) were selected from the above nine SNPs for replication in an African-American sample containing 1,251 subjects, including 412 males and 839 females. The SNP rs17354547 was successfully replicated in the male subgroup of the replication sample; it was associated with smoking quantity (SQ), the Heaviness of Smoking Index (HSI) and the Fagerstrom Test for Nicotine Dependence (FTND), with p values of 0.031, 0.0046 and 0.019, respectively. In addition, a haplotype formed by rs17354547, rs1402812 and rs4956396 was also associated with SQ, HSI and FTND, achieving p values of 0.039, 0.0093 and 0.0093, respectively. To further confirm our findings, we performed an in silico replication study of the nine SNPs in a Framingham Heart Study sample containing 7,623 Caucasians from 1,731 families, among which, 3,491 subjects are males and 4,132 are females. Again, male-specific association with smoking status was observed, for which seven of the nine SNPs achieved significant p values (p<0.05) and two achieved marginally significant p values (p<0.10) in males. Several of the nine SNPs, including the highly conserved one across species, rs17354547, are located at potential transcription factor binding sites, suggesting transcription regulation as a possible function for these SNPs. Through this function, the SNPs may modulate gene expression of IL15, a key cytokine regulating immune function. As the immune system has long been recognized to influence drug addiction behavior, our association findings suggest a novel mechanism for smoking addiction involving immune modulation via the IL15 pathway.
doi:10.1038/mp.2009.3
PMCID: PMC2700850  PMID: 19188921
smoking; nicotine addiction; IL15; genomewide association; genetics
16.  Framingham Heart Study 100K project: genome-wide associations for cardiovascular disease outcomes 
BMC Medical Genetics  2007;8(Suppl 1):S5.
Background
Cardiovascular disease (CVD) and its most common manifestations – including coronary heart disease (CHD), stroke, heart failure (HF), and atrial fibrillation (AF) – are major causes of morbidity and mortality. In many industrialized countries, cardiovascular disease (CVD) claims more lives each year than any other disease. Heart disease and stroke are the first and third leading causes of death in the United States. Prior investigations have reported several single gene variants associated with CHD, stroke, HF, and AF. We report a community-based genome-wide association study of major CVD outcomes.
Methods
In 1345 Framingham Heart Study participants from the largest 310 pedigrees (54% women, mean age 33 years at entry), we analyzed associations of 70,987 qualifying SNPs (Affymetrix 100K GeneChip) to four major CVD outcomes: major atherosclerotic CVD (n = 142; myocardial infarction, stroke, CHD death), major CHD (n = 118; myocardial infarction, CHD death), AF (n = 151), and HF (n = 73). Participants free of the condition at entry were included in proportional hazards models. We analyzed model-based deviance residuals using generalized estimating equations to test associations between SNP genotypes and traits in additive genetic models restricted to autosomal SNPs with minor allele frequency ≥0.10, genotype call rate ≥0.80, and Hardy-Weinberg equilibrium p-value ≥ 0.001.
Results
Six associations yielded p < 10-5. The lowest p-values for each CVD trait were as follows: major CVD, rs499818, p = 6.6 × 10-6; major CHD, rs2549513, p = 9.7 × 10-6; AF, rs958546, p = 4.8 × 10-6; HF: rs740363, p = 8.8 × 10-6. Of note, we found associations of a 13 Kb region on chromosome 9p21 with major CVD (p 1.7 – 1.9 × 10-5) and major CHD (p 2.5 – 3.5 × 10-4) that confirm associations with CHD in two recently reported genome-wide association studies. Also, rs10501920 in CNTN5 was associated with AF (p = 9.4 × 10-6) and HF (p = 1.2 × 10-4). Complete results for these phenotypes can be found at the dbgap website .
Conclusion
No association attained genome-wide significance, but several intriguing findings emerged. Notably, we replicated associations of chromosome 9p21 with major CVD. Additional studies are needed to validate these results. Finding genetic variants associated with CVD may point to novel disease pathways and identify potential targeted preventive therapies.
doi:10.1186/1471-2350-8-S1-S5
PMCID: PMC1995607  PMID: 17903304
17.  Ischemic stroke risk, smoking, and the genetics of inflammation in a biracial population: the stroke prevention in young women study 
Thrombosis Journal  2008;6:11.
Background
Although cigarette smoking is a well-established risk factor for vascular disease, the genetic mechanisms that link cigarette smoking to an increased incidence of stroke are not well understood. Genetic variations within the genes of the inflammatory pathways are thought to partially mediate this risk. Here we evaluate the association of several inflammatory gene single nucleotide polymorphisms (SNPs) with ischemic stroke risk among young women, further stratified by current cigarette smoking status.
Methods
A population-based case-control study of stroke among women aged 15–49 identified 224 cases of first ischemic stroke (47.3% African-American) and 211 age-comparable control subjects (43.1% African-American). Several inflammatory candidate gene SNPs chosen through literature review were genotyped in the study population and assessed for association with stroke and interaction with smoking status.
Results
Of the 8 SNPs (across 6 genes) analyzed, only IL6 SNP rs2069832 (allele C, African-American frequency = 92%, Caucasian frequency = 55%) was found to be significantly associated with stroke using an additive model, and this was only among African-Americans (age-adjusted: OR = 2.2, 95% CI = 1.0–5.0, p = 0.049; risk factor adjusted: OR = 2.5, 95% CI = 1.0–6.5, p = 0.05). When stratified by smoking status, two SNPs demonstrated statistically significant gene-environment interactions. First, the T allele (frequency = 5%) of IL6 SNP rs2069830 was found to be protective among non-smokers (OR = 0.30, 95% CI = 0.11–.082, p = 0.02), but not among smokers (OR = 1.63, 95% CI = 0.48–5.58, p = 0.43); genotype by smoking interaction (p = 0.036). Second, the C allele (frequency = 39%) of CD14 SNP rs2569190 was found to increase risk among smokers (OR = 2.05, 95% CI = 1.09–3.86, p = 0.03), but not among non-smokers (OR = 0.93, 95% CI = 0.62–1.39, p = 0.72); genotype by smoking interaction (p = 0.039).
Conclusion
This study demonstrates that inflammatory gene SNPs are associated with early-onset ischemic stroke among African-American women (IL6) and that cigarette smoking may modulate stroke risk through a gene-environment interaction (IL6 and CD14). Our finding replicates a prior study showing an interaction with smoking and the C allele of CD14 SNP rs2569190.
doi:10.1186/1477-9560-6-11
PMCID: PMC2533289  PMID: 18727828
18.  Phenotype prediction from genome-wide association studies: application to smoking behaviors 
BMC Systems Biology  2012;6(Suppl 2):S11.
Background
A great success of the genome wide association study enabled us to give more attention on the personal genome and clinical application such as diagnosis and disease risk prediction. However, previous prediction studies using known disease associated loci have not been successful (Area Under Curve 0.55 ~ 0.68 for type 2 diabetes and coronary heart disease). There are several reasons for poor predictability such as small number of known disease-associated loci, simple analysis not considering complexity in phenotype, and a limited number of features used for prediction.
Methods
In this research, we investigated the effect of feature selection and prediction algorithm on the performance of prediction method thoroughly. In particular, we considered the following feature selection and prediction methods: regression analysis, regularized regression analysis, linear discriminant analysis, non-linear support vector machine, and random forest. For these methods, we studied the effects of feature selection and the number of features on prediction. Our investigation was based on the analysis of 8,842 Korean individuals genotyped by Affymetrix SNP array 5.0, for predicting smoking behaviors.
Results
To observe the effect of feature selection methods on prediction performance, selected features were used for prediction and area under the curve score was measured. For feature selection, the performances of support vector machine (SVM) and elastic-net (EN) showed better results than those of linear discriminant analysis (LDA), random forest (RF) and simple logistic regression (LR) methods. For prediction, SVM showed the best performance based on area under the curve score. With less than 100 SNPs, EN was the best prediction method while SVM was the best if over 400 SNPs were used for the prediction.
Conclusions
Based on combination of feature selection and prediction methods, SVM showed the best performance in feature selection and prediction.
doi:10.1186/1752-0509-6-S2-S11
PMCID: PMC3521177  PMID: 23281841
19.  Genome-wide association study for subclinical atherosclerosis in major arterial territories in the NHLBI's Framingham Heart Study 
BMC Medical Genetics  2007;8(Suppl 1):S4.
Introduction
Subclinical atherosclerosis (SCA) measures in multiple arterial beds are heritable phenotypes that are associated with increased incidence of cardiovascular disease. We conducted a genome-wide association study (GWAS) for SCA measurements in the community-based Framingham Heart Study.
Methods
Over 100,000 single nucleotide polymorphisms (SNPs) were genotyped (Human 100K GeneChip, Affymetrix) in 1345 subjects from 310 families. We calculated sex-specific age-adjusted and multivariable-adjusted residuals in subjects tested for quantitative SCA phenotypes, including ankle-brachial index, coronary artery calcification and abdominal aortic calcification using multi-detector computed tomography, and carotid intimal medial thickness (IMT) using carotid ultrasonography. We evaluated associations of these phenotypes with 70,987 autosomal SNPs with minor allele frequency ≥ 0.10, call rate ≥ 80%, and Hardy-Weinberg p-value ≥ 0.001 in samples ranging from 673 to 984 subjects, using linear regression with generalized estimating equations (GEE) methodology and family-based association testing (FBAT). Variance components LOD scores were also calculated.
Results
There was no association result meeting criteria for genome-wide significance, but our methods identified 11 SNPs with p < 10-5 by GEE and five SNPs with p < 10-5 by FBAT for multivariable-adjusted phenotypes. Among the associated variants were SNPs in or near genes that may be considered candidates for further study, such as rs1376877 (GEE p < 0.000001, located in ABI2) for maximum internal carotid artery IMT and rs4814615 (FBAT p = 0.000003, located in PCSK2) for maximum common carotid artery IMT. Modest significant associations were noted with various SCA phenotypes for variants in previously reported atherosclerosis candidate genes, including NOS3 and ESR1. Associations were also noted of a region on chromosome 9p21 with CAC phenotypes that confirm associations with coronary heart disease and CAC in two recently reported genome-wide association studies. In linkage analyses, several regions of genome-wide linkage were noted, confirming previously reported linkage of internal carotid artery IMT on chromosome 12. All GEE, FBAT and linkage results are provided as an open-access results resource at .
Conclusion
The results from this GWAS generate hypotheses regarding several SNPs that may be associated with SCA phenotypes in multiple arterial beds. Given the number of tests conducted, subsequent independent replication in a staged approach is essential to identify genetic variants that may be implicated in atherosclerosis.
doi:10.1186/1471-2350-8-S1-S4
PMCID: PMC1995605  PMID: 17903303
20.  An omnibus permutation test on ensembles of two-locus analyses can detect pure epistasis and genetic heterogeneity in genome-wide association studies 
SpringerPlus  2013;2:230.
This article presents the ability of an omnibus permutation test on ensembles of two-locus analyses (2LOmb) to detect pure epistasis in the presence of genetic heterogeneity. The performance of 2LOmb is evaluated in various simulation scenarios covering two independent causes of complex disease where each cause is governed by a purely epistatic interaction. Different scenarios are set up by varying the number of available single nucleotide polymorphisms (SNPs) in data, number of causative SNPs and ratio of case samples from two affected groups. The simulation results indicate that 2LOmb outperforms multifactor dimensionality reduction (MDR) and random forest (RF) techniques in terms of a low number of output SNPs and a high number of correctly-identified causative SNPs. Moreover, 2LOmb is capable of identifying the number of independent interactions in tractable computational time and can be used in genome-wide association studies. 2LOmb is subsequently applied to a type 1 diabetes mellitus (T1D) data set, which is collected from a UK population by the Wellcome Trust Case Control Consortium (WTCCC). After screening for SNPs that locate within or near genes and exhibit no marginal single-locus effects, the T1D data set is reduced to 95,991 SNPs from 12,146 genes. The 2LOmb search in the reduced T1D data set reveals that 12 SNPs, which can be divided into two independent sets, are associated with the disease. The first SNP set consists of three SNPs from MUC21 (mucin 21, cell surface associated), three SNPs from MUC22 (mucin 22), two SNPs from PSORS1C1 (psoriasis susceptibility 1 candidate 1) and one SNP from TCF19 (transcription factor 19). A four-locus interaction between these four genes is also detected. The second SNP set consists of three SNPs from ATAD1 (ATPase family, AAA domain containing 1). Overall, the findings indicate the detection of pure epistasis in the presence of genetic heterogeneity and provide an alternative explanation for the aetiology of T1D in the UK population.
doi:10.1186/2193-1801-2-230
PMCID: PMC4006521  PMID: 24804170
Attribute selection; Complex disease; Epistasis; Genetic heterogeneity; Genome-wide association study; Pattern recognition; Permutation test; Single nucleotide polymorphism; Type 1 diabetes mellitus
21.  Genome-Wide Association Study of Gene by Smoking Interactions in Coronary Artery Calcification 
PLoS ONE  2013;8(10):e74642.
Many GWAS have identified novel loci associated with common diseases, but have focused only on main effects of individual genetic variants rather than interactions with environmental factors (GxE). Identification of GxE interactions is particularly important for coronary heart disease (CHD), a major preventable source of morbidity and mortality with strong non-genetic risk factors. Atherosclerosis is the major cause of CHD, and coronary artery calcification (CAC) is directly correlated with quantity of coronary atherosclerotic plaque. In the current study, we tested for genetic variants influencing extent of CAC via interaction with smoking (GxS), by conducting a GxS discovery GWAS in Genetic Epidemiology Network of Arteriopathy (GENOA) sibships (N = 915 European Americans) followed by replication in Framingham Heart Study (FHS) sibships (N = 1025 European Americans). Generalized estimating equations accounted for the correlation within sibships in strata-specific groups of smokers and nonsmokers, as well as GxS interaction. Primary analysis found SNPs that showed suggestive associations (p≤10−5) in GENOA GWAS, but these index SNPs did not replicate in FHS. However, secondary analysis was able to replicate candidate gene regions in FHS using other SNPs (+/−250 kb of GENOA index SNP). In smoker and nonsmoker groups, replicated genes included TCF7L2 (p = 6.0×10−5) and WWOX (p = 4.5×10−6); and TNFRSF8 (p = 7.8×10−5), respectively. For GxS interactions, replicated genes included TBC1D4 (p = 6.9×10−5) and ADAMTS9 (P = 7.1×10−5). Interestingly, these genes are involved in inflammatory pathways mediated by the NF-κB axis. Since smoking is known to induce chronic and systemic inflammation, association of these genes likely reflects roles in CAC development via inflammatory pathways. Furthermore, the NF-κB axis regulates bone remodeling, a key physiological process in CAC development. In conclusion, GxS GWAS has yielded evidence for novel loci that are associated with CAC via interaction with smoking, providing promising new targets for future population-based and functional studies of CAC development.
doi:10.1371/journal.pone.0074642
PMCID: PMC3789744  PMID: 24098343
22.  Genetic and functional association of FAM5C with myocardial infarction 
BMC Medical Genetics  2008;9:33.
Background
We previously identified a 40 Mb region of linkage on chromosome 1q in our early onset coronary artery disease (CAD) genome-wide linkage scan (GENECARD) with modest evidence for linkage (n = 420, LOD 0.95). When the data are stratified by acute coronary syndrome (ACS), this modest maximum in the overall group became a well-defined LOD peak (maximum LOD of 2.17, D1S1589/D1S518). This peak overlaps a recently identified inflammatory biomarker (MCP-1) linkage region from the Framingham Heart Study (maximum LOD of 4.27, D1S1589) and a region of linkage to metabolic syndrome from the IRAS study (maximum LOD of 2.59, D1S1589/D1S518). The overlap of genetic screens in independent data sets provides evidence for the existence of a gene or genes for CAD in this region.
Methods
A peak-wide association screen (457 SNPs) was conducted of a region 1 LOD score down from the peak marker (168–198 Mb) in a linkage peak for acute coronary syndrome (ACS) on chromosome 1, within a family-based early onset coronary artery disease (CAD) sample (GENECARD).
Results
Polymorphisms were identified within the 'family with sequence similarity 5, member C' gene (FAM5C) that show genetic linkage to and are associated with myocardial infarction (MI) in GENECARD. The association was confirmed in an independent CAD case-control sample (CATHGEN) and strong association with MI was identified with single nucleotide polymorphisms (SNPs) in the 3' end of FAM5C. FAM5C genotypes were also correlated with expression of the gene in human aorta. Expression levels of FAM5C decreased with increasing passage of proliferating aortic smooth muscle cells (SMC) suggesting a role for this molecule in smooth muscle cell proliferation and senescence.
Conclusion
These data implicate FAM5C alleles in the risk of myocardial infarction and suggest further functional studies of FAM5C are required to identify the gene's contribution to atherosclerosis.
doi:10.1186/1471-2350-9-33
PMCID: PMC2383879  PMID: 18430236
23.  Gene–smoking interactions in multiple Rho-GTPase pathway genes in an early-onset coronary artery disease cohort 
Human genetics  2013;132(12):10.1007/s00439-013-1339-7.
We performed a gene–smoking interaction analysis using families from an early-onset coronary artery disease cohort (GENECARD). This analysis was focused on validating and expanding results from previous studies implicating single nucleotide polymorphisms (SNPs) on chromosome 3 in smoking-mediated coronary artery disease. We analyzed 430 SNPs on chromosome 3 and identified 16 SNPs that showed a gene–smoking interaction at P < 0.05 using association in the presence of linkage—ordered subset analysis, a method that uses permutations of the data to empirically estimate the strength of the association signal. Seven of the 16 SNPs were in the Rho-GTPase pathway indicating a 1.87-fold enrichment for this pathway. A meta-analysis of gene–smoking interactions in three independent studies revealed that rs9289231 in KALRN had a Fisher’s combined P value of 0.0017 for the interaction with smoking. In a gene-based meta-analysis KALRN had a P value of 0.026. Finally, a pathway-based analysis of the association results using WebGestalt revealed several enriched pathways including the regulation of the actin cytoskeleton pathway as defined by the Kyoto Encyclopedia of Genes and Genomes.
doi:10.1007/s00439-013-1339-7
PMCID: PMC3835376  PMID: 23907653
24.  Determining Effects of Non-synonymous SNPs on Protein-Protein Interactions using Supervised and Semi-supervised Learning 
PLoS Computational Biology  2014;10(5):e1003592.
Single nucleotide polymorphisms (SNPs) are among the most common types of genetic variation in complex genetic disorders. A growing number of studies link the functional role of SNPs with the networks and pathways mediated by the disease-associated genes. For example, many non-synonymous missense SNPs (nsSNPs) have been found near or inside the protein-protein interaction (PPI) interfaces. Determining whether such nsSNP will disrupt or preserve a PPI is a challenging task to address, both experimentally and computationally. Here, we present this task as three related classification problems, and develop a new computational method, called the SNP-IN tool (non-synonymous SNP INteraction effect predictor). Our method predicts the effects of nsSNPs on PPIs, given the interaction's structure. It leverages supervised and semi-supervised feature-based classifiers, including our new Random Forest self-learning protocol. The classifiers are trained based on a dataset of comprehensive mutagenesis studies for 151 PPI complexes, with experimentally determined binding affinities of the mutant and wild-type interactions. Three classification problems were considered: (1) a 2-class problem (strengthening/weakening PPI mutations), (2) another 2-class problem (mutations that disrupt/preserve a PPI), and (3) a 3-class classification (detrimental/neutral/beneficial mutation effects). In total, 11 different supervised and semi-supervised classifiers were trained and assessed resulting in a promising performance, with the weighted f-measure ranging from 0.87 for Problem 1 to 0.70 for the most challenging Problem 3. By integrating prediction results of the 2-class classifiers into the 3-class classifier, we further improved its performance for Problem 3. To demonstrate the utility of SNP-IN tool, it was applied to study the nsSNP-induced rewiring of two disease-centered networks. The accurate and balanced performance of SNP-IN tool makes it readily available to study the rewiring of large-scale protein-protein interaction networks, and can be useful for functional annotation of disease-associated SNPs. SNIP-IN tool is freely accessible as a web-server at http://korkinlab.org/snpintool/.
Author Summary
Many genetic diseases in humans and animals are caused by combinations of single-letter mutations, or SNPs. When these mutations occur in a protein-coding region of a genome, they can have a profound effect on the protein's function and ultimately on a health-related phenotype. Recently, a growing number of evidence suggests that many of SNPs reside on or near the protein regions that are required for the interactions with other proteins. Some of these SNPs could rewire the protein-protein interactions altering the functions of the protein interaction complexes, while other SNPs are neutral to the interactions. Understanding the effect of SNPs on the protein-protein interactions is a challenging problem to solve, both experimentally and computationally. Here, we leverage the machine learning methods by training a computational predictor to tell apart the mutations that are harmful to protein-protein interactions from those ones that are not. We use these tools in two case studies of mutations affecting the protein-protein interaction networks centered around the genes associated with breast cancer and diabetes.
doi:10.1371/journal.pcbi.1003592
PMCID: PMC4006705  PMID: 24784581
25.  Capturing the Spectrum of Interaction Effects in Genetic Association Studies by Simulated Evaporative Cooling Network Analysis 
PLoS Genetics  2009;5(3):e1000432.
Evidence from human genetic studies of several disorders suggests that interactions between alleles at multiple genes play an important role in influencing phenotypic expression. Analytical methods for identifying Mendelian disease genes are not appropriate when applied to common multigenic diseases, because such methods investigate association with the phenotype only one genetic locus at a time. New strategies are needed that can capture the spectrum of genetic effects, from Mendelian to multifactorial epistasis. Random Forests (RF) and Relief-F are two powerful machine-learning methods that have been studied as filters for genetic case-control data due to their ability to account for the context of alleles at multiple genes when scoring the relevance of individual genetic variants to the phenotype. However, when variants interact strongly, the independence assumption of RF in the tree node-splitting criterion leads to diminished importance scores for relevant variants. Relief-F, on the other hand, was designed to detect strong interactions but is sensitive to large backgrounds of variants that are irrelevant to classification of the phenotype, which is an acute problem in genome-wide association studies. To overcome the weaknesses of these data mining approaches, we develop Evaporative Cooling (EC) feature selection, a flexible machine learning method that can integrate multiple importance scores while removing irrelevant genetic variants. To characterize detailed interactions, we construct a genetic-association interaction network (GAIN), whose edges quantify the synergy between variants with respect to the phenotype. We use simulation analysis to show that EC is able to identify a wide range of interaction effects in genetic association data. We apply the EC filter to a smallpox vaccine cohort study of single nucleotide polymorphisms (SNPs) and infer a GAIN for a collection of SNPs associated with adverse events. Our results suggest an important role for hubs in SNP disease susceptibility networks. The software is available at http://sites.google.com/site/McKinneyLab/software.
Author Summary
Susceptibility to many diseases and disorders is caused by breakdown at multiple points in the genetic network. Each of these points of breakdown by itself may have a very modest effect on disease risk but the points may have a much stronger effect through statistical interactions with each other. Genome-wide association studies provide the opportunity to identify alleles at multiple loci that interact to influence phenotypic variation in common diseases and disorders. However, if each SNP is tested for association as though it were independent of the rest of the genome, then the full advantage of the variation from markers across the genome will be unfulfilled. In this study, we illustrate the utility of a new approach to high-dimensional genetic association analysis that treats the collection of SNPs as interacting on a system level. This approach uses a machine-learning filter followed by an information theoretic and graph theoretic approach to infer a phenotype-specific network of interacting SNPs.
doi:10.1371/journal.pgen.1000432
PMCID: PMC2653647  PMID: 19300503

Results 1-25 (1468255)