Genome-wide association studies are often limited in their ability to attain their full potential due to the sheer volume of information created. We sought to use the random forest algorithm to identify single-nucleotide polymorphisms (SNPs) that may be involved in gene-by-smoking interactions related to the early-onset of coronary heart disease.
Using data from the Framingham Heart Study, our analysis used a case-only design in which the outcome of interest was age of onset of early coronary heart disease.
Smoking status was dichotomized as ever versus never. The single SNP with the highest importance score assigned by random forests was rs2011345. This SNP was not associated with age alone in the control subjects. Using generalized estimating equations to adjust for sex and account for familial correlation, there was evidence of an interaction between rs2011345 and smoking status.
The results of this analysis suggest that random forests may be a useful tool for identifying SNPs taking part in gene-by-environment interactions in genome-wide association studies.
Genome-wide association studies (GWAS) have helped to reveal genetic mechanisms of complex diseases. Although commonly used genotyping technology enables us to determine up to a million single-nucleotide polymorphisms (SNPs), causative variants are typically not genotyped directly. A favored approach to increase the power of genome-wide association studies is to impute the untyped SNPs using more complete genotype data of a reference population.
Random forests (RF) provides an internal method for replacing missing genotypes. A forest of classification trees is used to determine similarities of probands regarding their genotypes. These proximities are then used to impute genotypes of untyped SNPs.
We evaluated this approach using genotype data of the Framingham Heart Study provided as Problem 2 for Genetic Analysis Workshop 16 and the Caucasian HapMap samples as reference population. Our results indicate that RFs are faster but less accurate than alternative approaches for imputing untyped SNPs.
Recently we have shown that the human life span is influenced jointly by many common single-nucleotide polymorphisms (SNPs), each with a small individual effect. Here we investigate further the polygenic influence on life span and discuss its possible biological mechanisms. First we identified six sets of prolongevity SNP alleles in the Framingham Heart Study 550K SNPs data, using six different statistical procedures (normal linear, Cox, and logistic regressions; generalized estimation equation; mixed model; gene frequency method). We then estimated joint effects of these SNPs on human survival. We found that alleles in each set show significant additive influence on life span. Twenty-seven SNPs comprised the overlapping set of SNPs that influenced life span, regardless of the statistical procedure. The majority of these SNPs (74%) were within genes, compared to 40% of SNPs in the original 550K set. We then performed a review of current literature on functions of genes closest to these 27 SNPs. The review showed that the respective genes are largely involved in aging, cancer, and brain disorders. We concluded that polygenic effects can explain a substantial portion of genetic influence on life span. Composition of the set of prolongevity alleles depends on the statistical procedure used for the allele selection. At the same time, there is a core set of longevity alleles that are selected with all statistical procedures. Functional relevance of respective genes to aging and major diseases supports causal relationships between the identified SNPs and life span. The fact that genes found in our and other genetic association studies of aging/longevity have similar functions indicates high chances of true positive associations for corresponding genetic variants.
In population-based studies, it is generally recognized that single nucleotide polymorphism (SNP) markers are not independent. Rather, they are carried by haplotypes, groups of SNPs that tend to be coinherited. It is thus possible to choose a much smaller number of SNPs to use as indices for identifying haplotypes or haplotype blocks in genetic association studies. We refer to these characteristic SNPs as index SNPs. In order to reduce costs and work, a minimum number of index SNPs that can distinguish all SNP and haplotype patterns should be chosen. Unfortunately, this is an NP-complete problem, requiring brute force algorithms that are not feasible for large data sets.
We have developed a double classification tree search algorithm to generate index SNPs that can distinguish all SNP and haplotype patterns. This algorithm runs very rapidly and generates very good, though not necessarily minimum, sets of index SNPs, as is to be expected for such NP-complete problems.
A new algorithm for index SNP selection has been developed. A webserver for index SNP selection is available at
Age-dependent genetic effects on susceptibility to hypertension have been documented. We present a novel variance-component method for the estimation of age-dependent genetic effects on longitudinal systolic blood pressure using 57,827 Affymetrix single-nucleotide polymorphisms (SNPs) on chromosomes 17-22 genotyped in 2,475 members of the Offspring Cohort of the Framingham Heart Study. We used the likelihood-ratio test statistic to test the main genetic effect, genotype-by-age interaction, and simultaneously, main genetic effect and genotype-by-age interactions (2 degrees of freedom (df) test) for each SNP. Applying Bonferroni correction, three SNPs were significantly associated with longitudinal blood pressure in the analysis of main genetic effects or in combined 2-df analyses. For the associations detected using the simultaneous 2-df test, neither main effects nor genotype-by-age interaction p-values reached genome-wide statistical significance. The value of the 2-df test for screening genetic interaction effects could not be established in this study.
Traditional genome-wide association studies are generally limited in their ability explain a large portion of genetic risk for most common diseases. We sought to use both traditional GWAS methods, as well as more recently developed polygenic genome-wide analysis techniques to identify subsets of single-nucleotide polymorphisms (SNPs) that may be involved in risk of cardiovascular disease, as well as estimate the heritability explained by common SNPs.
Using data from the Framingham SNP Health Association Resource (SHARe), three complimentary methods were applied to examine the genetic factors associated with the Framingham Risk Score, a widely accepted indicator of underlying cardiovascular disease risk. The first method adopted a traditional GWAS approach - independently testing each SNP for association with the Framingham Risk Score. The second two approaches involved polygenic methods with the intention of providing estimates of aggregate genetic risk and heritability.
While no SNPs were independently associated with the Framingham Risk Score based on the results of the traditional GWAS analysis, we were able to identify cardiovascular disease-related SNPs as reported by previous studies. A predictive polygenic analysis was only able to explain approximately 1% of the genetic variance when predicting the 10-year risk of general cardiovascular disease. However, 20% to 30% of the variation in the Framingham Risk Score was explained using a recently developed method that considers the joint effect of all SNPs simultaneously.
The results of this study imply that common SNPs explain a large amount of the variation in the Framingham Risk Score and suggest that future, better-powered genome-wide association studies, possibly informed by knowledge of gene-pathways, will uncover more risk variants that will help to elucidate the genetic architecture of cardiovascular disease.
Due to the high-dimensionality of single-nucleotide polymorphism (SNP) data, region-based methods are an attractive approach to the identification of genetic variation associated with a certain phenotype. A common approach to defining regions is to identify the most significant SNPs from a single-SNP association analysis, and then use a gene database to obtain a list of genes proximal to the identified SNPs. Alternatively, regions may be defined statistically, via a scan statistic. After categorizing SNPs as significant or not (based on the single-SNP association p-values), a scan statistic is useful to identify regions that contain more significant SNPs than expected by chance. Important features of this method are that regions are defined statistically, so that there is no dependence on a gene database, and both gene and inter-gene regions can be detected. In the analysis of blood-lipid phenotypes from the Framingham Heart Study (FHS), we compared statistically defined regions with those formed from the top single SNP tests. Although we missed a number of single SNPs, we also identified many additional regions not found as SNP-database regions and avoided issues related to region definition. In addition, analyses of candidate genes for high-density lipoprotein, low-density lipoprotein, and triglyceride levels suggested that associations detected with region-based statistics are also found using the scan statistic approach.
A major goal of genetic association studies concerned with single nucleotide polymorphisms (SNPs) is the detection of SNPs exhibiting an impact on the risk of developing a disease. Typically, this problem is approached by testing each of the SNPs individually. This, however, can lead to an inaccurate measurement of the influence of the SNPs on the disease risk, in particular, if SNPs only show an effect when interacting with other SNPs, as the multivariate structure of the data is ignored. In this article, we propose a testing procedure based on logic regression that takes this structure into account and therefore enables a more appropriate quantification of importance and ranking of the SNPs than marginal testing. Since even SNP interactions often exhibit only a moderate effect on the disease risk, it can be helpful to also consider sets of SNPs (e.g. SNPs belonging to the same gene or pathway) to borrow strength across these SNP sets and to identify those genes or pathways comprising SNPs that are most consistently associated with the response. We show how the proposed procedure can be adapted for testing SNP sets, and how it can be applied to blocks of SNPs in linkage disequilibrium (LD) to overcome problems caused by LD.
Feature selection; GENICA; Importance measure; logicFS; Logic regression
Determining the most promising single-nucleotide polymorphisms (SNPs) presents a challenge in genome-wide association studies, when hundreds of thousands of association tests are conducted. The power to detect genetic effects is dependent on minor allele frequency (MAF), and genome-wide association studies SNP arrays include SNPs with a wide distribution of MAFs. Therefore, it is critical to understand MAF's effect on the false positive rate.
Data from the Framingham Heart Study simulated data (Problem 3, with answers) was used to examine the effects of varying MAFs on the likelihood of false positives. Replication set 1 was used to generate 1 million permutations of case/control status in unrelated individuals. Logistic regression was used to test for the association between each SNP and myocardial infarction using an additive model. We report the number of "significant" tests by MAF at α = 10-4, 10-5, and 10-6.
Common SNPs exhibited fewer false positives than expected. At α = 10-4, SNPs with MAF 25% and 50% resulted in 69.2 [95%CI: 62.8-75.6] and 70.8 [95%CI: 61.3-80.4] false positives, respectively, compared to 100 expected. Rare SNPs exhibited more variability but did not show more false-positive results than expected by chance. However, at α = 10-4, MAF = 5% exhibited significantly more false positives (105.5 [95%CI: 81-130.1]) than MAF = 25% and 50%. Similar results were seen at the other alpha values.
These results suggest that removal of low MAF SNPs from analysis due to concerns about inflated false-positive results may not be appropriate.
Systemic biomarkers provide insights into disease pathogenesis, diagnosis, and risk stratification. Many systemic biomarker concentrations are heritable phenotypes. Genome-wide association studies (GWAS) provide mechanisms to investigate the genetic contributions to biomarker variability unconstrained by current knowledge of physiological relations.
We examined the association of Affymetrix 100K GeneChip single nucleotide polymorphisms (SNPs) to 22 systemic biomarker concentrations in 4 biological domains: inflammation/oxidative stress; natriuretic peptides; liver function; and vitamins. Related members of the Framingham Offspring cohort (n = 1012; mean age 59 ± 10 years, 51% women) had both phenotype and genotype data (minimum-maximum per phenotype n = 507–1008). We used Generalized Estimating Equations (GEE), Family Based Association Tests (FBAT) and variance components linkage to relate SNPs to multivariable-adjusted biomarker residuals. Autosomal SNPs (n = 70,987) meeting the following criteria were studied: minor allele frequency ≥ 10%, call rate ≥ 80% and Hardy-Weinberg equilibrium p ≥ 0.001.
With GEE, 58 SNPs had p < 10-6: the top SNPs were rs2494250 (p = 1.00*10-14) and rs4128725 (p = 3.68*10-12) for monocyte chemoattractant protein-1 (MCP1), and rs2794520 (p = 2.83*10-8) and rs2808629 (p = 3.19*10-8) for C-reactive protein (CRP) averaged from 3 examinations (over about 20 years). With FBAT, 11 SNPs had p < 10-6: the top SNPs were the same for MCP1 (rs4128725, p = 3.28*10-8, and rs2494250, p = 3.55*10-8), and also included B-type natriuretic peptide (rs437021, p = 1.01*10-6) and Vitamin K percent undercarboxylated osteocalcin (rs2052028, p = 1.07*10-6). The peak LOD (logarithm of the odds) scores were for MCP1 (4.38, chromosome 1) and CRP (3.28, chromosome 1; previously described) concentrations; of note the 1.5 support interval included the MCP1 and CRP SNPs reported above (GEE model). Previous candidate SNP associations with circulating CRP concentrations were replicated at p < 0.05; the SNPs rs2794520 and rs2808629 are in linkage disequilibrium with previously reported SNPs. GEE, FBAT and linkage results are posted at .
The Framingham GWAS represents a resource to describe potentially novel genetic influences on systemic biomarker variability. The newly described associations will need to be replicated in other studies.
The Framingham Heart Study (FHS) recently obtained initial results from the first genome-wide association scan for renal traits. The study of 70,987 single nucleotide polymorphisms (SNPs) in 1,010 FHS participants provides a list of SNPs showing the strongest associations with renal traits which need to be verified in independent study samples.
Sixteen SNPs were selected for replication based on the most promising associations with chronic kidney disease (CKD), estimated glomerular filtration rate (eGFR), and serum cystatin C in FHS. These SNPs were genotyped in 15,747 participants of the Atherosclerosis in Communities (ARIC) Study and evaluated for association using multivariable adjusted regression analyses. Primary outcomes in ARIC were CKD and eGFR. Secondary prospective analyses were conducted for association with kidney disease progression using multivariable adjusted Cox proportional hazards regression. The definition of the outcomes, all covariates, and the use of an additive genetic model was consistent with the original analyses in FHS.
The intronic SNP rs6495446 in the gene MTHFS was significantly associated with CKD among white ARIC participants at visit 4: the odds ratio per each C allele was 1.24 (95% CI 1.09–1.41, p = 0.001). Borderline significant associations of rs6495446 were observed with CKD at study visit 1 (p = 0.024), eGFR at study visits 1 (p = 0.073) and 4 (lower mean eGFR per C allele by 0.6 ml/min/1.73 m2, p = 0.043) and kidney disease progression (hazard ratio 1.13 per each C allele, 95% CI 1.00–1.26, p = 0.041). Another SNP, rs3779748 in EYA1, was significantly associated with CKD at ARIC visit 1 (odds ratio per each T allele 1.22, p = 0.01), but only with eGFR and cystatin C in FHS.
This genome-wide association study provides unbiased information implicating MTHFS as a candidate gene for kidney disease. Our findings highlight the importance of replication to identify common SNPs associated with renal traits.
To account for population stratification in association studies, principal-components analysis is often performed on single-nucleotide polymorphisms (SNPs) across the genome. Here, we use Framingham Heart Study (FHS) Genetic Analysis Workshop 16 data to compare the performance of local ancestry adjustment for population stratification based on principal components (PCs) estimated from SNPs in a local chromosomal region with global ancestry adjustment based on PCs estimated from genome-wide SNPs.
Standardized height residuals from unrelated adults from the FHS Offspring Cohort were averaged from longitudinal data. PCs of SNP genotype data were calculated to represent individual's ancestry either 1) globally using all SNPs across the genome or 2) locally using SNPs in adjacent 20-Mbp regions within each chromosome. We assessed the extent to which there were differences in association studies of height depending on whether PCs for global, local, or both global and local ancestry were included as covariates.
The correlations between local and global PCs were low (r < 0.12), suggesting variability between local and global ancestry estimates. Genome-wide association tests without any ancestry adjustment demonstrated an inflated type I error rate that decreased with adjustment for local ancestry, global ancestry, or both. A known spurious association was replicated for SNPs within the lactase gene, and this false-positive association was abolished by adjustment with local or global ancestry PCs.
Population stratification is a potential source of bias in this seemingly homogenous FHS population. However, local and global PCs derived from SNPs appear to provide adequate information about ancestry.
Linkage disequilibrium (LD) is an important measure used in the analysis of single-nucleotide polymorphism (SNP) data. We used the Genetic Analysis Workshop 16 (GAW16) Framingham Heart Study 500 k SNP data to explore the effect of sampling methods on estimating of LD for SNP data.
Method and data
We found 332 trios in the GAW16 Framingham SNP data. Repeated random samples without replacement, of different sizes of trios and independent individuals, are drawn from these 332 trios. For each sample, the LD is calculated using the Haploview program for the chromosome 1 SNP data. Percents of D' > 0.8 and r2 > 0.8 are calculated for different distance bins based on the Haploview output. The results are summarized by sample size and sampling methods to give us an overall view of the effect of sample size and sampling methods on the LD estimation.
Trios design gave stable estimates. A sample of 30 to 40 trios gave estimates of percent of LD > 0.8 very close to those from 332 trios. When independent individuals are used, the estimates are less stable and are different from those obtained from the 332 trios for both D' and r2, with larger differences for D'.
Our results suggest that trio design gives a stable estimate of LD. Therefore it may be more suitable for LD analysis than using independent individuals. We must be cautious when comparing the LD estimates from trios, and those from independent individuals.
One of the most challenging points in studying human common complex diseases is to search for both strong and weak susceptibility single-nucleotide polymorphisms (SNPs) and identify forms of genetic disease models. Currently, a number of methods have been proposed for this purpose. Many of them have not been validated through applications into various genome datasets, so their abilities are not clear in real practice. In this paper, we present a novel SNP association study method based on probability theory, called ProbSNP. The method firstly detects SNPs by evaluating their joint probabilities in combining with disease status and selects those with the lowest joint probabilities as susceptibility ones, and then identifies some forms of genetic disease models through testing multiple-locus interactions among the selected SNPs. The joint probabilities of combined SNPs are estimated by establishing Gaussian distribution probability density functions, in which the related parameters (i.e., mean value and standard deviation) are evaluated based on allele and haplotype frequencies. Finally, we test and validate the method using various genome datasets. We find that ProbSNP has shown remarkable success in the applications to both simulated genome data and real genome-wide data.
Association study; SNPs; probability theory; Gaussian distribution; case-control
Genome wide association studies (GWAS) have identified numerous single nucleotide polymorphisms (SNPs) that are associated with a variety of common human diseases. Due to the weak marginal effect of most disease-associated SNPs, attention has recently turned to evaluating the combined effect of multiple disease-associated SNPs on the risk of disease. Several recent multigenic studies show potential evidence of applying multigenic approaches in association studies of various diseases including lung cancer. But the question remains as to the best methodology to analyze single nucleotide polymorphisms in multiple genes. In this work, we consider four methods—logistic regression, logic regression, classification tree, and random forests—to compare results for identifying important genes or gene-gene and gene-environmental interactions. To evaluate the performance of four methods, the cross-validation misclassification error and areas under the curves are provided. We performed a simulation study and applied them to the data from a large-scale, population-based, case-control study.
SNP interactions; Logistic regression; Classification tree; Logic regression; Random Forests; Cross-validation error; Area under the Curve
Human genome contains millions of common single nucleotide polymorphisms (SNPs) and these SNPs play an important role in understanding the association between genetic variations and human diseases. Many SNPs show correlated genotypes, or linkage disequilibrium (LD), thus it is not necessary to genotype all SNPs for association study. Many algorithms have been developed to find a small subset of SNPs called tag SNPs that are sufficient to infer all the other SNPs. Algorithms based on the r2 LD statistic have gained popularity because r2 is directly related to statistical power to detect disease associations. Most of existing r2 based algorithms use pairwise LD. Recent studies show that multi-marker LD can help further reduce the number of tag SNPs. However, existing tag SNP selection algorithms based on multi-marker LD are both time-consuming and memory-consuming. They cannot work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.
We propose an efficient algorithm called FastTagger to calculate multi-marker tagging rules and select tag SNPs based on multi-marker LD. FastTagger uses several techniques to reduce running time and memory consumption. Our experiment results show that FastTagger is several times faster than existing multi-marker based tag SNP selection algorithms, and it consumes much less memory at the same time. As a result, FastTagger can work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.
FastTagger also produces smaller sets of tag SNPs than existing multi-marker based algorithms, and the reduction ratio ranges from 3%-9% when length-3 tagging rules are used. The generated tagging rules can also be used for genotype imputation. We studied the prediction accuracy of individual rules, and the average accuracy is above 96% when r2 ≥ 0.9.
Generating multi-marker tagging rules is a computation intensive task, and it is the bottleneck of existing multi-marker based tag SNP selection methods. FastTagger is a practical and scalable algorithm to solve this problem.
Studies have shown that interactions of single nucleotide polymorphism (SNP) may play an important role for understanding causes of complex disease. Machine learning approaches provide useful features to explore interactions more effectively and efficiently. We have proposed an integrated method that combines two machine learning methods - Random Forests (RF) and Multivariate Adaptive Regression Splines (MARS) - to identify a subset of important SNPs and detect interaction patterns. In this two-stage RF-MARS (TRM) approach, RF is first applied to detect a predictive subset of SNPs, and then MARS is used to identify the interaction patterns among the selected SNPs. We evaluated the TRM performances in four models: three causal models with one two-way interaction and one null model. RF variable selection was based on out-of-bag classification error rate (OOB) and variable important spectrum (IS). First, we compared the selection of important variable of RF and MARS. Our results support that RFOOB had better performance than MARS and RFIS in detecting important variables. We also evaluated the true positive rate and false positive rate of identifying interaction patterns in TRM and MARS. This study demonstrates that TRMOOB, which is RFOOB plus MARS, has combined the strengths of RF and MARS in identifying SNP-SNP interaction patterns in a scenario of 100 candidate SNPs. TRMOOB had greater true positive rate and lower false positive rate compared with MARS, particularly for searching interactions with a strong association with the outcome. Therefore the use of TRMOOB is favored for exploring SNP-SNP interactions in a large-scale genetic variation study.
polymorphism; interaction; machine learning
Genome-wide association studies offer an unbiased approach to identify new candidate genes for osteoporosis. We examined the Affymetrix 500K + 50K SNP GeneChip marker sets for associations with multiple osteoporosis-related traits at various skeletal sites, including bone mineral density (BMD, hip and spine), heel ultrasound, and hip geometric indices in the Framingham Osteoporosis Study. We evaluated 433,510 single-nucleotide polymorphisms (SNPs) in 2073 women (mean age 65 years), members of two-generational families. Variance components analysis was performed to estimate phenotypic, genetic, and environmental correlations (ρP, ρG, and ρE) among bone traits. Linear mixed-effects models were used to test associations between SNPs and multivariable-adjusted trait values. We evaluated the proportion of SNPs associated with pairs of the traits at a nominal significance threshold α = 0.01. We found substantial correlation between the proportion of associated SNPs and the ρP and ρG (r = 0.91 and 0.84, respectively) but much lower with ρE (r = 0.38). Thus, for example, hip and spine BMD had 6.8% associated SNPs in common, corresponding to ρP = 0.55 and ρG = 0.66 between them. Fewer SNPs were associated with both BMD and any of the hip geometric traits (eg, femoral neck and shaft width, section moduli, neck shaft angle, and neck length); ρG between BMD and geometric traits ranged from −0.24 to +0.40. In conclusion, we examined relationships between osteoporosis-related traits based on genome-wide associations. Most of the similarity between the quantitative bone phenotypes may be attributed to pleiotropic effects of genes. This knowledge may prove helpful in defining the best phenotypes to be used in genetic studies of osteoporosis. © 2010 American Society for Bone and Mineral Research.
bone mineral density; quantitative ultrasound; femoral geometry; genome-wide association; single-nucleotide polymorphisms; genetic correlations; pleiotropy
A variety of diseases are caused by chromosomal abnormalities such as aneuploidies (having an abnormal number of chromosomes), microdeletions, microduplications, and uniparental disomy. High density single nucleotide polymorphism (SNP) microarrays provide information on chromosomal copy number changes, as well as genotype (heterozygosity and homozygosity). SNP array studies generate multiple types of data for each SNP site, some with more than 100,000 SNPs represented on each array. The identification of different classes of anomalies within SNP data has been challenging.
We have developed SNPscan, a web-accessible tool to analyze and visualize high density SNP data. It enables researchers (1) to visually and quantitatively assess the quality of user-generated SNP data relative to a benchmark data set derived from a control population, (2) to display SNP intensity and allelic call data in order to detect chromosomal copy number anomalies (duplications and deletions), (3) to display uniparental isodisomy based on loss of heterozygosity (LOH) across genomic regions, (4) to compare paired samples (e.g. tumor and normal), and (5) to generate a file type for viewing SNP data in the University of California, Santa Cruz (UCSC) Human Genome Browser. SNPscan accepts data exported from Affymetrix Copy Number Analysis Tool as its input. We validated SNPscan using data generated from patients with known deletions, duplications, and uniparental disomy. We also inspected previously generated SNP data from 90 apparently normal individuals from the Centre d'Étude du Polymorphisme Humain (CEPH) collection, and identified three cases of uniparental isodisomy, four females having an apparently mosaic X chromosome, two mislabelled SNP data sets, and one microdeletion on chromosome 2 with mosaicism from an apparently normal female. These previously unrecognized abnormalities were all detected using SNPscan. The microdeletion was independently confirmed by fluorescence in situ hybridization, and a region of homozygosity in a UPD case was confirmed by sequencing of genomic DNA.
SNPscan is useful to identify chromosomal abnormalities based on SNP intensity (such as chromosomal copy number changes) and heterozygosity data (including regions of LOH and some cases of UPD). The program and source code are available at the SNPscan website .
The widespread use of high-throughput methods of single nucleotide polymorphism (SNP) genotyping has created a number of computational and statistical challenges. The problem of identifying SNP–SNP interactions in case–control studies has been studied extensively, and a number of new techniques have been developed. Little progress has been made, however, in the analysis of SNP–SNP interactions in relation to time-to-event data, such as patient survival time or time to cancer relapse. We present an extension of the two class multifactor dimensionality reduction (MDR) algorithm that enables detection and characterization of epistatic SNP–SNP interactions in the context of survival analysis. The proposed Survival MDR (Surv-MDR) method handles survival data by modifying MDR’s constructive induction algorithm to use the log-rank test. Surv-MDR replaces balanced accuracy with log-rank test statistics as the score to determine the best models. We simulated datasets with a survival outcome related to two loci in the absence of any marginal effects. We compared Surv-MDR with Cox-regression for their ability to identify the true predictive loci in these simulated data. We also used this simulation to construct the empirical distribution of Surv-MDR’s testing score. We then applied Surv-MDR to genetic data from a population-based epidemiologic study to find prognostic markers of survival time following a bladder cancer diagnosis. We identified several two-loci SNP combinations that have strong associations with patients’ survival outcome. Surv-MDR is capable of detecting interaction models with weak main effects. These epistatic models tend to be dropped by traditional Cox regression approaches to evaluating interactions. With improved efficiency to handle genome wide datasets, Surv-MDR will play an important role in a research strategy that embraces the complexity of the genotype–phenotype mapping relationship since epistatic interactions are an important component of the genetic basis of disease.
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
With the advent of cost-effective genotyping technologies, genome-wide association studies allow researchers to examine hundreds of thousands of single nucleotide polymorphisms (SNPs) for association with human disease. Recently, many researchers applying this strategy have detected strong associations to disease with SNP markers that are either not in linkage disequilibrium with any nonsynonymous SNP or large distances from any annotated gene. In such cases, no well-established standard practice for effective SNP selection for follow-up studies exists. We aim to identify and prioritize groups of SNPs that are more likely to affect phenotypes in order to facilitate efficient SNP selection for follow-up studies.
Based on the annotations available in the Ensembl database, we categorized SNPs in the human genome into classes related to regulatory attributes, such as epigenetic modifications and transcription factor binding sites, in addition to classes related to gene structure and cross-species conservation. Using the distribution of derived allele frequencies (DAF) within each class, we assessed the strength of natural selection for each class relative to the genome as a whole. We applied this DAF analysis to Perlegen resequenced SNPs genome-wide. Regulatory elements annotated by Ensembl such as specific histone methylation sites as well as classes defined by cross-species conservation showed negative selection in comparison to the genome as a whole.
These results highlight which annotated classes are under purifying selection, have putative functional importance, and contain SNPs that are strong candidates for follow-up studies after genome-wide association. Such SNP annotation may also be useful in interpreting results of whole-genome sequencing studies.
Polymorphisms in the MYH9 and adjacent APOL1 gene region demonstrate a strong association with non-diabetic kidney disease in African-Americans. However, it is not known to what extent these polymorphisms are present in other ethnic groups. To examine the association of genetic polymorphisms in this region with chronic kidney disease (CKD; estimated glomerular filtration rate <60 ml/min/1.73 m2) in individuals of European ancestry, we examined rs4821480, an MYH9 single-nucleotide polymorphism (SNP) recently identified as associated with kidney disease in African-Americans, in 13 133 participants from the Framingham Heart Study (FHS) and Atherosclerosis Risk in Communities (ARIC) Study. In addition, we further interrogated the MYH9/APOL1 gene region using 282 SNPs for association with CKD using age-, sex- and center-adjusted models and performed a meta-analysis of the results from both studies. Because of prior data linking rs4821480 and kidney disease, we used a P-value of <0.05 to test the association with CKD. In the meta-analysis, rs4821480 (minor allele frequency 4.45 and 3.96% in FHS and ARIC, respectively) was associated with higher CKD prevalence in participants free of diabetes (odds ratio 1.44; 95% confidence interval 1.15–1.80; P = 0.001). No other SNPs achieved significance after adjusting for multiple testing. Results utilizing directly genotyped data confirmed the results of the primary analysis. Recently identified APOL1 risk variants were also directly genotyped, but did not account for the observed MYH9 signal. These data suggest that the MYH9 polymorphism rs4821480 is associated with an increased risk of non-diabetic CKD in individuals of European ancestry.
We performed a pairwise epistatic interaction test using the chicken 60 K single nucleotide polymorphism (SNP) chip for the 11th generation of the Northeast Agricultural University broiler lines divergently selected for abdominal fat content. A linear mixed model was used to test two dimensions of SNP interactions affecting abdominal fat weight. With a threshold of P<1.2×10−11 by a Bonferroni 5% correction, 52 pairs of SNPs were detected, comprising 45 pairs showing an Additive×Additive and seven pairs showing an Additive×Dominance epistatic effect. The contribution rates of significant epistatic interactive SNPs ranged from 0.62% to 1.54%, with 47 pairs contributing more than 1%. The SNP-SNP network affecting abdominal fat weight constructed using the significant SNP pairs was analyzed, estimated and annotated. On the basis of the network’s features, SNPs Gga_rs14303341 and Gga_rs14988623 at the center of the subnet should be important nodes, and an interaction between GGAZ and GGA8 was suggested. Twenty-two quantitative trait loci, 97 genes (including nine non-coding genes), and 50 pathways were annotated on the epistatic interactive SNP-SNP network. The results of the present study provide insights into the genetic architecture underlying broiler chicken abdominal fat weight.
Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF.
We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.
Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.