Genome-wide association studies are often limited in their ability to attain their full potential due to the sheer volume of information created. We sought to use the random forest algorithm to identify single-nucleotide polymorphisms (SNPs) that may be involved in gene-by-smoking interactions related to the early-onset of coronary heart disease.
Using data from the Framingham Heart Study, our analysis used a case-only design in which the outcome of interest was age of onset of early coronary heart disease.
Smoking status was dichotomized as ever versus never. The single SNP with the highest importance score assigned by random forests was rs2011345. This SNP was not associated with age alone in the control subjects. Using generalized estimating equations to adjust for sex and account for familial correlation, there was evidence of an interaction between rs2011345 and smoking status.
The results of this analysis suggest that random forests may be a useful tool for identifying SNPs taking part in gene-by-environment interactions in genome-wide association studies.
Genome-wide association studies (GWAS) have helped to reveal genetic mechanisms of complex diseases. Although commonly used genotyping technology enables us to determine up to a million single-nucleotide polymorphisms (SNPs), causative variants are typically not genotyped directly. A favored approach to increase the power of genome-wide association studies is to impute the untyped SNPs using more complete genotype data of a reference population.
Random forests (RF) provides an internal method for replacing missing genotypes. A forest of classification trees is used to determine similarities of probands regarding their genotypes. These proximities are then used to impute genotypes of untyped SNPs.
We evaluated this approach using genotype data of the Framingham Heart Study provided as Problem 2 for Genetic Analysis Workshop 16 and the Caucasian HapMap samples as reference population. Our results indicate that RFs are faster but less accurate than alternative approaches for imputing untyped SNPs.
Recently we have shown that the human life span is influenced jointly by many common single-nucleotide polymorphisms (SNPs), each with a small individual effect. Here we investigate further the polygenic influence on life span and discuss its possible biological mechanisms. First we identified six sets of prolongevity SNP alleles in the Framingham Heart Study 550K SNPs data, using six different statistical procedures (normal linear, Cox, and logistic regressions; generalized estimation equation; mixed model; gene frequency method). We then estimated joint effects of these SNPs on human survival. We found that alleles in each set show significant additive influence on life span. Twenty-seven SNPs comprised the overlapping set of SNPs that influenced life span, regardless of the statistical procedure. The majority of these SNPs (74%) were within genes, compared to 40% of SNPs in the original 550K set. We then performed a review of current literature on functions of genes closest to these 27 SNPs. The review showed that the respective genes are largely involved in aging, cancer, and brain disorders. We concluded that polygenic effects can explain a substantial portion of genetic influence on life span. Composition of the set of prolongevity alleles depends on the statistical procedure used for the allele selection. At the same time, there is a core set of longevity alleles that are selected with all statistical procedures. Functional relevance of respective genes to aging and major diseases supports causal relationships between the identified SNPs and life span. The fact that genes found in our and other genetic association studies of aging/longevity have similar functions indicates high chances of true positive associations for corresponding genetic variants.
Multiple studies have identified single-nucleotide polymorphisms (SNPs) that are associated with coronary heart disease (CHD). We examined whether SNPs selected based on predefined criteria will improve CHD risk prediction when added to traditional risk factors (TRFs).
SNPs were selected from the literature based on association with CHD, lack of association with a known CHD risk factor, and successful replication. A genetic risk score (GRS) was constructed based on these SNPs. Cox proportional hazards model was used to calculate CHD risk based on the Atherosclerosis Risk in Communities (ARIC) and Framingham CHD risk scores with and without the GRS.
The GRS was associated with risk for CHD (hazard ratio [HR] = 1.10; 95% confidence interval [CI]: 1.07–1.13). Addition of the GRS to the ARIC risk score significantly improved discrimination, reclassification, and calibration beyond that afforded by TRFs alone in non-Hispanic whites in the ARIC study. The area under the receiver operating characteristic curve (AUC) increased from 0.742 to 0.749 (Δ= 0.007; 95% CI, 0.004–0.013), and the net reclassification index (NRI) was 6.3%. Although the risk estimates for CHD in the Framingham Offspring (HR = 1.12; 95% CI: 1.10–1.14) and Rotterdam (HR = 1.08; 95% CI: 1.02–1.14) Studies were significantly improved by adding the GRS to TRFs, improvements in AUC and NRI were modest.
Addition of a GRS based on direct associations with CHD to TRFs significantly improved discrimination and reclassification in white participants of the ARIC Study, with no significant improvement in the Rotterdam and Framingham Offspring Studies.
Genetics; Risk factors; Coronary disease
Age-dependent genetic effects on susceptibility to hypertension have been documented. We present a novel variance-component method for the estimation of age-dependent genetic effects on longitudinal systolic blood pressure using 57,827 Affymetrix single-nucleotide polymorphisms (SNPs) on chromosomes 17-22 genotyped in 2,475 members of the Offspring Cohort of the Framingham Heart Study. We used the likelihood-ratio test statistic to test the main genetic effect, genotype-by-age interaction, and simultaneously, main genetic effect and genotype-by-age interactions (2 degrees of freedom (df) test) for each SNP. Applying Bonferroni correction, three SNPs were significantly associated with longitudinal blood pressure in the analysis of main genetic effects or in combined 2-df analyses. For the associations detected using the simultaneous 2-df test, neither main effects nor genotype-by-age interaction p-values reached genome-wide statistical significance. The value of the 2-df test for screening genetic interaction effects could not be established in this study.
In population-based studies, it is generally recognized that single nucleotide polymorphism (SNP) markers are not independent. Rather, they are carried by haplotypes, groups of SNPs that tend to be coinherited. It is thus possible to choose a much smaller number of SNPs to use as indices for identifying haplotypes or haplotype blocks in genetic association studies. We refer to these characteristic SNPs as index SNPs. In order to reduce costs and work, a minimum number of index SNPs that can distinguish all SNP and haplotype patterns should be chosen. Unfortunately, this is an NP-complete problem, requiring brute force algorithms that are not feasible for large data sets.
We have developed a double classification tree search algorithm to generate index SNPs that can distinguish all SNP and haplotype patterns. This algorithm runs very rapidly and generates very good, though not necessarily minimum, sets of index SNPs, as is to be expected for such NP-complete problems.
A new algorithm for index SNP selection has been developed. A webserver for index SNP selection is available at
Recently, genome wide association studies (GWAS) have identified a number of single nucleotide polymorphisms (SNPs) as being associated with coronary heart disease (CHD). We estimated the effect of these SNPs on incident CHD, stroke and total mortality in the prospective cohorts of the MORGAM Project. We studied cohorts from Finland, Sweden, France and Northern Ireland (total N = 33,282, including 1,436 incident CHD events and 571 incident stroke events). The lead SNPs at seven loci identified thus far and additional SNPs (in total 42) were genotyped using a case-cohort design. We estimated the effect of the SNPs on disease history at baseline, disease events during follow-up and classic risk factors. Multiple testing was taken into account using false discovery rate (FDR) analysis. SNP rs1333049 on chromosome 9p21.3 was associated with both CHD and stroke (HR = 1.20, 95% CI 1.08–1.34 for incident CHD events and 1.15, 0.99–1.34 for incident stroke). SNP rs11670734 (19q12) was associated with total mortality and stroke. SNP rs2146807 (10q11.21) showed some association with the fatality of acute coronary event. SNP rs2943634 (2q36.3) was associated with high density lipoprotein (HDL) cholesterol and SNPs rs599839, rs4970834 (1p13.3) and rs17228212 (15q22.23) were associated with non-HDL cholesterol. SNPs rs2943634 (2q36.3) and rs12525353 (6q25.1) were associated with blood pressure. These findings underline the need for replication studies in prospective settings and confirm the candidacy of several SNPs that may play a role in the etiology of cardiovascular disease.
cardiovascular disease; genes; risk factors
Traditional genome-wide association studies are generally limited in their ability explain a large portion of genetic risk for most common diseases. We sought to use both traditional GWAS methods, as well as more recently developed polygenic genome-wide analysis techniques to identify subsets of single-nucleotide polymorphisms (SNPs) that may be involved in risk of cardiovascular disease, as well as estimate the heritability explained by common SNPs.
Using data from the Framingham SNP Health Association Resource (SHARe), three complimentary methods were applied to examine the genetic factors associated with the Framingham Risk Score, a widely accepted indicator of underlying cardiovascular disease risk. The first method adopted a traditional GWAS approach - independently testing each SNP for association with the Framingham Risk Score. The second two approaches involved polygenic methods with the intention of providing estimates of aggregate genetic risk and heritability.
While no SNPs were independently associated with the Framingham Risk Score based on the results of the traditional GWAS analysis, we were able to identify cardiovascular disease-related SNPs as reported by previous studies. A predictive polygenic analysis was only able to explain approximately 1% of the genetic variance when predicting the 10-year risk of general cardiovascular disease. However, 20% to 30% of the variation in the Framingham Risk Score was explained using a recently developed method that considers the joint effect of all SNPs simultaneously.
The results of this study imply that common SNPs explain a large amount of the variation in the Framingham Risk Score and suggest that future, better-powered genome-wide association studies, possibly informed by knowledge of gene-pathways, will uncover more risk variants that will help to elucidate the genetic architecture of cardiovascular disease.
Systemic biomarkers provide insights into disease pathogenesis, diagnosis, and risk stratification. Many systemic biomarker concentrations are heritable phenotypes. Genome-wide association studies (GWAS) provide mechanisms to investigate the genetic contributions to biomarker variability unconstrained by current knowledge of physiological relations.
We examined the association of Affymetrix 100K GeneChip single nucleotide polymorphisms (SNPs) to 22 systemic biomarker concentrations in 4 biological domains: inflammation/oxidative stress; natriuretic peptides; liver function; and vitamins. Related members of the Framingham Offspring cohort (n = 1012; mean age 59 ± 10 years, 51% women) had both phenotype and genotype data (minimum-maximum per phenotype n = 507–1008). We used Generalized Estimating Equations (GEE), Family Based Association Tests (FBAT) and variance components linkage to relate SNPs to multivariable-adjusted biomarker residuals. Autosomal SNPs (n = 70,987) meeting the following criteria were studied: minor allele frequency ≥ 10%, call rate ≥ 80% and Hardy-Weinberg equilibrium p ≥ 0.001.
With GEE, 58 SNPs had p < 10-6: the top SNPs were rs2494250 (p = 1.00*10-14) and rs4128725 (p = 3.68*10-12) for monocyte chemoattractant protein-1 (MCP1), and rs2794520 (p = 2.83*10-8) and rs2808629 (p = 3.19*10-8) for C-reactive protein (CRP) averaged from 3 examinations (over about 20 years). With FBAT, 11 SNPs had p < 10-6: the top SNPs were the same for MCP1 (rs4128725, p = 3.28*10-8, and rs2494250, p = 3.55*10-8), and also included B-type natriuretic peptide (rs437021, p = 1.01*10-6) and Vitamin K percent undercarboxylated osteocalcin (rs2052028, p = 1.07*10-6). The peak LOD (logarithm of the odds) scores were for MCP1 (4.38, chromosome 1) and CRP (3.28, chromosome 1; previously described) concentrations; of note the 1.5 support interval included the MCP1 and CRP SNPs reported above (GEE model). Previous candidate SNP associations with circulating CRP concentrations were replicated at p < 0.05; the SNPs rs2794520 and rs2808629 are in linkage disequilibrium with previously reported SNPs. GEE, FBAT and linkage results are posted at .
The Framingham GWAS represents a resource to describe potentially novel genetic influences on systemic biomarker variability. The newly described associations will need to be replicated in other studies.
Determining the most promising single-nucleotide polymorphisms (SNPs) presents a challenge in genome-wide association studies, when hundreds of thousands of association tests are conducted. The power to detect genetic effects is dependent on minor allele frequency (MAF), and genome-wide association studies SNP arrays include SNPs with a wide distribution of MAFs. Therefore, it is critical to understand MAF's effect on the false positive rate.
Data from the Framingham Heart Study simulated data (Problem 3, with answers) was used to examine the effects of varying MAFs on the likelihood of false positives. Replication set 1 was used to generate 1 million permutations of case/control status in unrelated individuals. Logistic regression was used to test for the association between each SNP and myocardial infarction using an additive model. We report the number of "significant" tests by MAF at α = 10-4, 10-5, and 10-6.
Common SNPs exhibited fewer false positives than expected. At α = 10-4, SNPs with MAF 25% and 50% resulted in 69.2 [95%CI: 62.8-75.6] and 70.8 [95%CI: 61.3-80.4] false positives, respectively, compared to 100 expected. Rare SNPs exhibited more variability but did not show more false-positive results than expected by chance. However, at α = 10-4, MAF = 5% exhibited significantly more false positives (105.5 [95%CI: 81-130.1]) than MAF = 25% and 50%. Similar results were seen at the other alpha values.
These results suggest that removal of low MAF SNPs from analysis due to concerns about inflated false-positive results may not be appropriate.
Due to the high-dimensionality of single-nucleotide polymorphism (SNP) data, region-based methods are an attractive approach to the identification of genetic variation associated with a certain phenotype. A common approach to defining regions is to identify the most significant SNPs from a single-SNP association analysis, and then use a gene database to obtain a list of genes proximal to the identified SNPs. Alternatively, regions may be defined statistically, via a scan statistic. After categorizing SNPs as significant or not (based on the single-SNP association p-values), a scan statistic is useful to identify regions that contain more significant SNPs than expected by chance. Important features of this method are that regions are defined statistically, so that there is no dependence on a gene database, and both gene and inter-gene regions can be detected. In the analysis of blood-lipid phenotypes from the Framingham Heart Study (FHS), we compared statistically defined regions with those formed from the top single SNP tests. Although we missed a number of single SNPs, we also identified many additional regions not found as SNP-database regions and avoided issues related to region definition. In addition, analyses of candidate genes for high-density lipoprotein, low-density lipoprotein, and triglyceride levels suggested that associations detected with region-based statistics are also found using the scan statistic approach.
A major goal of genetic association studies concerned with single nucleotide polymorphisms (SNPs) is the detection of SNPs exhibiting an impact on the risk of developing a disease. Typically, this problem is approached by testing each of the SNPs individually. This, however, can lead to an inaccurate measurement of the influence of the SNPs on the disease risk, in particular, if SNPs only show an effect when interacting with other SNPs, as the multivariate structure of the data is ignored. In this article, we propose a testing procedure based on logic regression that takes this structure into account and therefore enables a more appropriate quantification of importance and ranking of the SNPs than marginal testing. Since even SNP interactions often exhibit only a moderate effect on the disease risk, it can be helpful to also consider sets of SNPs (e.g. SNPs belonging to the same gene or pathway) to borrow strength across these SNP sets and to identify those genes or pathways comprising SNPs that are most consistently associated with the response. We show how the proposed procedure can be adapted for testing SNP sets, and how it can be applied to blocks of SNPs in linkage disequilibrium (LD) to overcome problems caused by LD.
Feature selection; GENICA; Importance measure; logicFS; Logic regression
The Framingham Heart Study (FHS) recently obtained initial results from the first genome-wide association scan for renal traits. The study of 70,987 single nucleotide polymorphisms (SNPs) in 1,010 FHS participants provides a list of SNPs showing the strongest associations with renal traits which need to be verified in independent study samples.
Sixteen SNPs were selected for replication based on the most promising associations with chronic kidney disease (CKD), estimated glomerular filtration rate (eGFR), and serum cystatin C in FHS. These SNPs were genotyped in 15,747 participants of the Atherosclerosis in Communities (ARIC) Study and evaluated for association using multivariable adjusted regression analyses. Primary outcomes in ARIC were CKD and eGFR. Secondary prospective analyses were conducted for association with kidney disease progression using multivariable adjusted Cox proportional hazards regression. The definition of the outcomes, all covariates, and the use of an additive genetic model was consistent with the original analyses in FHS.
The intronic SNP rs6495446 in the gene MTHFS was significantly associated with CKD among white ARIC participants at visit 4: the odds ratio per each C allele was 1.24 (95% CI 1.09–1.41, p = 0.001). Borderline significant associations of rs6495446 were observed with CKD at study visit 1 (p = 0.024), eGFR at study visits 1 (p = 0.073) and 4 (lower mean eGFR per C allele by 0.6 ml/min/1.73 m2, p = 0.043) and kidney disease progression (hazard ratio 1.13 per each C allele, 95% CI 1.00–1.26, p = 0.041). Another SNP, rs3779748 in EYA1, was significantly associated with CKD at ARIC visit 1 (odds ratio per each T allele 1.22, p = 0.01), but only with eGFR and cystatin C in FHS.
This genome-wide association study provides unbiased information implicating MTHFS as a candidate gene for kidney disease. Our findings highlight the importance of replication to identify common SNPs associated with renal traits.
Inhibition of the endocannabinoid receptor CB1 improves insulin sensitivity, lowers glycemia and slows atherosclerosis. We analyzed if common variants in the gene encoding CB1, CNR1, are associated with insulin resistance, risk of type 2 diabetes (T2D) or coronary heart disease (CHD).
We studied 2,411 participants of the Framingham Offspring Study (mean age 60 years, 52% women) for quantitative traits and CHD, and the Framingham SHARe database for T2D risk. We genotyped 19 single nucleotide polymorphisms (SNPs) that tagged 85% (at r2=0.8) of common (>5%) CNR1 SNPs. Fasting blood glucose and insulin at the 7th (1999–2001) exam were collected. We used age-, sex-, BMI-adjusted models to test additive associations of genotype with HOMA-IR (linear mixed-effect models), T2D or CHD. To account for multiple tests of SNPs, we generated empirical P values. The C allele at SNP rs806365 (frequency, 57.4%), ~4.1kb 3′ from CNR1, was associated with increased HOMA-IR (n=2,261, beta=0.05 per C, empirical P=0.01), risk of T2D (674 cases, OR=1.19 per C, nominal P=0.01) and CHD (237 cases, HR=1.23 per C, nominal P=0.04). The association of rs806365 with HOMA-IR was replicated in a meta-analysis of two independent cohorts (NHANES III plus Partners Case-Control Diabetes Study; 2,540 white individuals, beta=0.037, nominal P=0.007), but not in the large MAGIC Consortium (n=29,248, nominal P=0.74). The association of rs806365 was not replicated either with T2D in DIAGRAM (n=10,128, nominal P=0.31), or with CHD in PROCARDIS (n=13,614, nominal P=0.37).
Although supported by initial results, we found no reproducible statistical association of common variation at CNR1 with insulin resistance, T2D or CHD.
endocannabinoids; candidate genes; diabetes mellitus; insulin resistance; coronary heart disease
Linkage disequilibrium (LD) is an important measure used in the analysis of single-nucleotide polymorphism (SNP) data. We used the Genetic Analysis Workshop 16 (GAW16) Framingham Heart Study 500 k SNP data to explore the effect of sampling methods on estimating of LD for SNP data.
Method and data
We found 332 trios in the GAW16 Framingham SNP data. Repeated random samples without replacement, of different sizes of trios and independent individuals, are drawn from these 332 trios. For each sample, the LD is calculated using the Haploview program for the chromosome 1 SNP data. Percents of D' > 0.8 and r2 > 0.8 are calculated for different distance bins based on the Haploview output. The results are summarized by sample size and sampling methods to give us an overall view of the effect of sample size and sampling methods on the LD estimation.
Trios design gave stable estimates. A sample of 30 to 40 trios gave estimates of percent of LD > 0.8 very close to those from 332 trios. When independent individuals are used, the estimates are less stable and are different from those obtained from the 332 trios for both D' and r2, with larger differences for D'.
Our results suggest that trio design gives a stable estimate of LD. Therefore it may be more suitable for LD analysis than using independent individuals. We must be cautious when comparing the LD estimates from trios, and those from independent individuals.
To account for population stratification in association studies, principal-components analysis is often performed on single-nucleotide polymorphisms (SNPs) across the genome. Here, we use Framingham Heart Study (FHS) Genetic Analysis Workshop 16 data to compare the performance of local ancestry adjustment for population stratification based on principal components (PCs) estimated from SNPs in a local chromosomal region with global ancestry adjustment based on PCs estimated from genome-wide SNPs.
Standardized height residuals from unrelated adults from the FHS Offspring Cohort were averaged from longitudinal data. PCs of SNP genotype data were calculated to represent individual's ancestry either 1) globally using all SNPs across the genome or 2) locally using SNPs in adjacent 20-Mbp regions within each chromosome. We assessed the extent to which there were differences in association studies of height depending on whether PCs for global, local, or both global and local ancestry were included as covariates.
The correlations between local and global PCs were low (r < 0.12), suggesting variability between local and global ancestry estimates. Genome-wide association tests without any ancestry adjustment demonstrated an inflated type I error rate that decreased with adjustment for local ancestry, global ancestry, or both. A known spurious association was replicated for SNPs within the lactase gene, and this false-positive association was abolished by adjustment with local or global ancestry PCs.
Population stratification is a potential source of bias in this seemingly homogenous FHS population. However, local and global PCs derived from SNPs appear to provide adequate information about ancestry.
Detection of genomic DNA copy number variations (CNVs) can provide a complete and more comprehensive view of human disease. It is interesting to identify and represent relevant CNVs from a genome-wide data due to high data volume and the complexity of interactions.
In this paper, we incorporate the DNA copy number variation data derived from SNP arrays into a computational shrunken model and formalize the detection of copy number variations as a case-control classification problem. More than 80% accuracy can be obtained using our classification model and by shrinkage, the number of relevant CNVs to disease can be determined. In order to understand relevant CNVs, we study their corresponding SNPs in the genome and a statistical software PLINK is employed to compute the pair-wise SNP-SNP interactions, and identify SNP networks based on their P-values. Our selected SNP networks are statistically significant compared with random SNP networks and play a role in the biological process. For the unique genes that those SNPs are located in, a gene-gene similarity value is computed using GOSemSim and gene pairs that have similarity values being greater than a threshold are selected to construct gene networks. A gene enrichment analysis show that our gene networks are functionally important.
Experimental results demonstrate that our selected SNP and gene networks based on the selected CNVs contain some functional relationships directly or indirectly to disease study.
Two datasets are given to demonstrate the effectiveness of the introduced method. Some statistical and biological analysis show that this shrunken classification model is effective in identifying CNVs from genome-wide data and our proposed framework has a potential to become a useful analysis tool for SNP data sets.
One of the most challenging points in studying human common complex diseases is to search for both strong and weak susceptibility single-nucleotide polymorphisms (SNPs) and identify forms of genetic disease models. Currently, a number of methods have been proposed for this purpose. Many of them have not been validated through applications into various genome datasets, so their abilities are not clear in real practice. In this paper, we present a novel SNP association study method based on probability theory, called ProbSNP. The method firstly detects SNPs by evaluating their joint probabilities in combining with disease status and selects those with the lowest joint probabilities as susceptibility ones, and then identifies some forms of genetic disease models through testing multiple-locus interactions among the selected SNPs. The joint probabilities of combined SNPs are estimated by establishing Gaussian distribution probability density functions, in which the related parameters (i.e., mean value and standard deviation) are evaluated based on allele and haplotype frequencies. Finally, we test and validate the method using various genome datasets. We find that ProbSNP has shown remarkable success in the applications to both simulated genome data and real genome-wide data.
Association study; SNPs; probability theory; Gaussian distribution; case-control
Genome wide association studies (GWAS) have identified numerous single nucleotide polymorphisms (SNPs) that are associated with a variety of common human diseases. Due to the weak marginal effect of most disease-associated SNPs, attention has recently turned to evaluating the combined effect of multiple disease-associated SNPs on the risk of disease. Several recent multigenic studies show potential evidence of applying multigenic approaches in association studies of various diseases including lung cancer. But the question remains as to the best methodology to analyze single nucleotide polymorphisms in multiple genes. In this work, we consider four methods—logistic regression, logic regression, classification tree, and random forests—to compare results for identifying important genes or gene-gene and gene-environmental interactions. To evaluate the performance of four methods, the cross-validation misclassification error and areas under the curves are provided. We performed a simulation study and applied them to the data from a large-scale, population-based, case-control study.
SNP interactions; Logistic regression; Classification tree; Logic regression; Random Forests; Cross-validation error; Area under the Curve
Studies have shown that interactions of single nucleotide polymorphism (SNP) may play an important role for understanding causes of complex disease. Machine learning approaches provide useful features to explore interactions more effectively and efficiently. We have proposed an integrated method that combines two machine learning methods - Random Forests (RF) and Multivariate Adaptive Regression Splines (MARS) - to identify a subset of important SNPs and detect interaction patterns. In this two-stage RF-MARS (TRM) approach, RF is first applied to detect a predictive subset of SNPs, and then MARS is used to identify the interaction patterns among the selected SNPs. We evaluated the TRM performances in four models: three causal models with one two-way interaction and one null model. RF variable selection was based on out-of-bag classification error rate (OOB) and variable important spectrum (IS). First, we compared the selection of important variable of RF and MARS. Our results support that RFOOB had better performance than MARS and RFIS in detecting important variables. We also evaluated the true positive rate and false positive rate of identifying interaction patterns in TRM and MARS. This study demonstrates that TRMOOB, which is RFOOB plus MARS, has combined the strengths of RF and MARS in identifying SNP-SNP interaction patterns in a scenario of 100 candidate SNPs. TRMOOB had greater true positive rate and lower false positive rate compared with MARS, particularly for searching interactions with a strong association with the outcome. Therefore the use of TRMOOB is favored for exploring SNP-SNP interactions in a large-scale genetic variation study.
polymorphism; interaction; machine learning
Genome-wide association studies offer an unbiased approach to identify new candidate genes for osteoporosis. We examined the Affymetrix 500K + 50K SNP GeneChip marker sets for associations with multiple osteoporosis-related traits at various skeletal sites, including bone mineral density (BMD, hip and spine), heel ultrasound, and hip geometric indices in the Framingham Osteoporosis Study. We evaluated 433,510 single-nucleotide polymorphisms (SNPs) in 2073 women (mean age 65 years), members of two-generational families. Variance components analysis was performed to estimate phenotypic, genetic, and environmental correlations (ρP, ρG, and ρE) among bone traits. Linear mixed-effects models were used to test associations between SNPs and multivariable-adjusted trait values. We evaluated the proportion of SNPs associated with pairs of the traits at a nominal significance threshold α = 0.01. We found substantial correlation between the proportion of associated SNPs and the ρP and ρG (r = 0.91 and 0.84, respectively) but much lower with ρE (r = 0.38). Thus, for example, hip and spine BMD had 6.8% associated SNPs in common, corresponding to ρP = 0.55 and ρG = 0.66 between them. Fewer SNPs were associated with both BMD and any of the hip geometric traits (eg, femoral neck and shaft width, section moduli, neck shaft angle, and neck length); ρG between BMD and geometric traits ranged from −0.24 to +0.40. In conclusion, we examined relationships between osteoporosis-related traits based on genome-wide associations. Most of the similarity between the quantitative bone phenotypes may be attributed to pleiotropic effects of genes. This knowledge may prove helpful in defining the best phenotypes to be used in genetic studies of osteoporosis. © 2010 American Society for Bone and Mineral Research.
bone mineral density; quantitative ultrasound; femoral geometry; genome-wide association; single-nucleotide polymorphisms; genetic correlations; pleiotropy
Human genome contains millions of common single nucleotide polymorphisms (SNPs) and these SNPs play an important role in understanding the association between genetic variations and human diseases. Many SNPs show correlated genotypes, or linkage disequilibrium (LD), thus it is not necessary to genotype all SNPs for association study. Many algorithms have been developed to find a small subset of SNPs called tag SNPs that are sufficient to infer all the other SNPs. Algorithms based on the r2 LD statistic have gained popularity because r2 is directly related to statistical power to detect disease associations. Most of existing r2 based algorithms use pairwise LD. Recent studies show that multi-marker LD can help further reduce the number of tag SNPs. However, existing tag SNP selection algorithms based on multi-marker LD are both time-consuming and memory-consuming. They cannot work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.
We propose an efficient algorithm called FastTagger to calculate multi-marker tagging rules and select tag SNPs based on multi-marker LD. FastTagger uses several techniques to reduce running time and memory consumption. Our experiment results show that FastTagger is several times faster than existing multi-marker based tag SNP selection algorithms, and it consumes much less memory at the same time. As a result, FastTagger can work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.
FastTagger also produces smaller sets of tag SNPs than existing multi-marker based algorithms, and the reduction ratio ranges from 3%-9% when length-3 tagging rules are used. The generated tagging rules can also be used for genotype imputation. We studied the prediction accuracy of individual rules, and the average accuracy is above 96% when r2 ≥ 0.9.
Generating multi-marker tagging rules is a computation intensive task, and it is the bottleneck of existing multi-marker based tag SNP selection methods. FastTagger is a practical and scalable algorithm to solve this problem.
A variety of diseases are caused by chromosomal abnormalities such as aneuploidies (having an abnormal number of chromosomes), microdeletions, microduplications, and uniparental disomy. High density single nucleotide polymorphism (SNP) microarrays provide information on chromosomal copy number changes, as well as genotype (heterozygosity and homozygosity). SNP array studies generate multiple types of data for each SNP site, some with more than 100,000 SNPs represented on each array. The identification of different classes of anomalies within SNP data has been challenging.
We have developed SNPscan, a web-accessible tool to analyze and visualize high density SNP data. It enables researchers (1) to visually and quantitatively assess the quality of user-generated SNP data relative to a benchmark data set derived from a control population, (2) to display SNP intensity and allelic call data in order to detect chromosomal copy number anomalies (duplications and deletions), (3) to display uniparental isodisomy based on loss of heterozygosity (LOH) across genomic regions, (4) to compare paired samples (e.g. tumor and normal), and (5) to generate a file type for viewing SNP data in the University of California, Santa Cruz (UCSC) Human Genome Browser. SNPscan accepts data exported from Affymetrix Copy Number Analysis Tool as its input. We validated SNPscan using data generated from patients with known deletions, duplications, and uniparental disomy. We also inspected previously generated SNP data from 90 apparently normal individuals from the Centre d'Étude du Polymorphisme Humain (CEPH) collection, and identified three cases of uniparental isodisomy, four females having an apparently mosaic X chromosome, two mislabelled SNP data sets, and one microdeletion on chromosome 2 with mosaicism from an apparently normal female. These previously unrecognized abnormalities were all detected using SNPscan. The microdeletion was independently confirmed by fluorescence in situ hybridization, and a region of homozygosity in a UPD case was confirmed by sequencing of genomic DNA.
SNPscan is useful to identify chromosomal abnormalities based on SNP intensity (such as chromosomal copy number changes) and heterozygosity data (including regions of LOH and some cases of UPD). The program and source code are available at the SNPscan website .
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
The widespread use of high-throughput methods of single nucleotide polymorphism (SNP) genotyping has created a number of computational and statistical challenges. The problem of identifying SNP–SNP interactions in case–control studies has been studied extensively, and a number of new techniques have been developed. Little progress has been made, however, in the analysis of SNP–SNP interactions in relation to time-to-event data, such as patient survival time or time to cancer relapse. We present an extension of the two class multifactor dimensionality reduction (MDR) algorithm that enables detection and characterization of epistatic SNP–SNP interactions in the context of survival analysis. The proposed Survival MDR (Surv-MDR) method handles survival data by modifying MDR’s constructive induction algorithm to use the log-rank test. Surv-MDR replaces balanced accuracy with log-rank test statistics as the score to determine the best models. We simulated datasets with a survival outcome related to two loci in the absence of any marginal effects. We compared Surv-MDR with Cox-regression for their ability to identify the true predictive loci in these simulated data. We also used this simulation to construct the empirical distribution of Surv-MDR’s testing score. We then applied Surv-MDR to genetic data from a population-based epidemiologic study to find prognostic markers of survival time following a bladder cancer diagnosis. We identified several two-loci SNP combinations that have strong associations with patients’ survival outcome. Surv-MDR is capable of detecting interaction models with weak main effects. These epistatic models tend to be dropped by traditional Cox regression approaches to evaluating interactions. With improved efficiency to handle genome wide datasets, Surv-MDR will play an important role in a research strategy that embraces the complexity of the genotype–phenotype mapping relationship since epistatic interactions are an important component of the genetic basis of disease.