Motivation: Gene set enrichment has become a critical tool for interpreting the results of high-throughput genomic experiments. Inconsistent annotation quality and lack of annotation specificity, however, limit the statistical power of enrichment methods and make it difficult to replicate enrichment results across biologically similar datasets.
Results: We propose a novel algorithm for optimizing gene set annotations to best match the structure of specific empirical data sources. Our proposed method, entropy minimization over variable clusters (EMVC), filters the annotations for each gene set to minimize a measure of entropy across disjoint gene clusters computed for a range of cluster sizes over multiple bootstrap resampled datasets. As shown using simulated gene sets with simulated data and Molecular Signatures Database collections with microarray gene expression data, the EMVC algorithm accurately filters annotations unrelated to the experimental outcome resulting in increased gene set enrichment power and better replication of enrichment results.
Availability and implementation:
Supplementary data are available at Bioinformatics online.
Motivation: Epistasis, the presence of gene–gene interactions, has been hypothesized to be at the root of many common human diseases, but current genome-wide association studies largely ignore its role. Multifactor dimensionality reduction (MDR) is a powerful model-free method for detecting epistatic relationships between genes, but computational costs have made its application to genome-wide data difficult. Graphics processing units (GPUs), the hardware responsible for rendering computer games, are powerful parallel processors. Using GPUs to run MDR on a genome-wide dataset allows for statistically rigorous testing of epistasis.
Results: The implementation of MDR for GPUs (MDRGPU) includes core features of the widely used Java software package, MDR. This GPU implementation allows for large-scale analysis of epistasis at a dramatically lower cost than the standard CPU-based implementations. As a proof-of-concept, we applied this software to a genome-wide study of sporadic amyotrophic lateral sclerosis (ALS). We discovered a statistically significant two-SNP classifier and subsequently replicated the significance of these two SNPs in an independent study of ALS. MDRGPU makes the large-scale analysis of epistasis tractable and opens the door to statistically rigorous testing of interactions in genome-wide datasets.
Availability: MDRGPU is open source and available free of charge from http://www.sourceforge.net/projects/mdr.
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: The sequencing of the human genome has made it possible to identify an informative set of >1 million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWASs). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving health care through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype–phenotype relationship that is characterized by significant heterogeneity and gene–gene and gene–environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods.
As the cost of genome-wide genotyping decreases, the number of genome-wide association studies (GWAS) has increased considerably. However, the transition from GWAS findings to the underlying biology of various phenotypes remains challenging. As a result, due to its system-level interpretability, pathway analysis has become a popular tool for gaining insights on the underlying biology from high-throughput genetic association data. In pathway analyses, gene sets representing particular biological processes are tested for significant associations with a given phenotype. Most existing pathway analysis approaches rely on single-marker statistics and assume that pathways are independent of each other. As biological systems are driven by complex biomolecular interactions, embracing the complex relationships between single-nucleotide polymorphisms (SNPs) and pathways needs to be addressed. To incorporate the complexity of gene-gene interactions and pathway-pathway relationships, we propose a system-level pathway analysis approach, synthetic feature random forest (SF-RF), which is designed to detect pathway-phenotype associations without making assumptions about the relationships among SNPs or pathways. In our approach, the genotypes of SNPs in a particular pathway are aggregated into a synthetic feature representing that pathway via Random Forest (RF). Multiple synthetic features are analyzed using RF simultaneously and the significance of a synthetic feature indicates the significance of the corresponding pathway. We further complement SF-RF with pathway-based Statistical Epistasis Network (SEN) analysis that evaluates interactions among pathways. By investigating the pathway SEN, we hope to gain additional insights into the genetic mechanisms contributing to the pathway-phenotype association. We apply SF-RF to a population-based genetic study of bladder cancer and further investigate the mechanisms that help explain the pathway-phenotype associations using SEN. The bladder cancer associated pathways we found are both consistent with existing biological knowledge and reveal novel and plausible hypotheses for future biological validations.
interactions; epistasis; pathway analysis; synthetic feature random forest (SF-RF); statistical epistasis network (SEN)
Many colleges and universities across the globe now offer bachelors, masters, and doctoral degrees, along with certificate programs in bioinformatics. While there is some consensus surrounding curricula competencies, programs vary greatly in their core foci, with some leaning heavily toward the biological sciences and others toward quantitative areas. This allows prospective students to choose a program that best fits their interests and career goals. In the digital age, most scientific fields are facing an enormous growth of data, and as a consequence, the goals and challenges of bioinformatics are rapidly changing; this requires that bioinformatics education also change. In this workshop, we seek to ascertain current trends in bioinformatics education by asking the question, “What are the core competencies all bioinformaticians should have at the end of their training, and how successful have programs been in placing students in desired careers?”
Whether your interests lie in scientific arenas, the corporate world, or in government, you have certainly heard the praises of big data: Big data will give you new insights, allow you to become more efficient, and/or will solve your problems. While big data has had some outstanding successes, many are now beginning to see that it is not the Silver Bullet that it has been touted to be. Here our main concern is the overall impact of big data; the current manifestation of big data is constructing a Maginot Line in science in the 21st century. Big data is not “lots of data” as a phenomena anymore; The big data paradigm is putting the spirit of the Maginot Line into lots of data. Big data overall is disconnecting researchers and science challenges. We propose No-Boundary Thinking (NBT), applying no-boundary thinking in problem defining to address science challenges.
Big data; Maginot Line; No-Boundary thinking
The non-linear interaction effect among multiple genetic factors, i.e. epistasis, has been recognized as a key component in understanding the underlying genetic basis of complex human diseases and phenotypic traits. Due to the statistical and computational complexity, most epistasis studies are limited to interactions with an order of two. We developed ViSEN to analyze and visualize epistatic interactions of both two-way and three-way. ViSEN not only identifies strong interactions among pairs or trios of genetic attributes, but also provides a global interaction map that shows neighborhood and clustering structures. This visualized information could be very helpful to infer the underlying genetic architecture of complex diseases and to generate plausible hypotheses for further biological validations. ViSEN is implemented in Java and freely available at https://sourceforge.net/projects/visen/.
epistasis; gene-gene interaction; high-order interaction; networks; visualization; software; genome-wide association; complex diseases
In gene regulatory circuits, the expression of individual genes is commonly modulated by a set of regulating gene products, which bind to a gene’s cis-regulatory region. This region encodes an input-output function, referred to as signal-integration logic, that maps a specific combination of regulatory signals (inputs) to a particular expression state (output) of a gene. The space of all possible signal-integration functions is vast and the mapping from input to output is many-to-one: for the same set of inputs, many functions (genotypes) yield the same expression output (phenotype). Here, we exhaustively enumerate the set of signal-integration functions that yield idential gene expression patterns within a computational model of gene regulatory circuits. Our goal is to characterize the relationship between robustness and evolvability in the signal-integration space of regulatory circuits, and to understand how these properties vary between the genotypic and phenotypic scales. Among other results, we find that the distributions of genotypic robustness are skewed, such that the majority of signal-integration functions are robust to perturbation. We show that the connected set of genotypes that make up a given phenotype are constrained to specific regions of the space of all possible signal-integration functions, but that as the distance between genotypes increases, so does their capacity for unique innovations. In addition, we find that robust phenotypes are (i) evolvable, (ii) easily identified by random mutation, and (iii) mutationally biased toward other robust phenotypes. We explore the implications of these latter observations for mutation-based evolution by conducting random walks between randomly chosen source and target phenotypes. We demonstrate that the time required to identify the target phenotype is independent of the properties of the source phenotype.
Evolutionary Innovation; Random Boolean Circuit; Genetic Regulation; Genotype Networks; Genotype-Phenotype Map
Permutation-based statistics for evaluating the significance of class prediction, predictive attributes, and patterns of association have only appeared within the learning classifier system (LCS) literature since 2012. While still not widely utilized by the LCS research community, formal evaluations of test statistic confidence are imperative to large and complex real world applications such as genetic epidemiology where it is standard practice to quantify the likelihood that a seemingly meaningful statistic could have been obtained purely by chance. LCS algorithms are relatively computationally expensive on their own. The compounding requirements for generating permutation-based statistics may be a limiting factor for some researchers interested in applying LCS algorithms to real world problems. Technology has made LCS parallelization strategies more accessible and thus more popular in recent years. In the present study we examine the benefits of externally parallelizing a series of independent LCS runs such that permutation testing with cross validation becomes more feasible to complete on a single multi-core workstation. We test our python implementation of this strategy in the context of a simulated complex genetic epidemiological data mining problem. Our evaluations indicate that as long as the number of concurrent processes does not exceed the number of CPU cores, the speedup achieved is approximately linear.
Algorithms; Performance; Design; LCS; significance testing; parallelization; scalability; multi-core processors
Indoor and outdoor air pollution is known to contribute to increased lung cancer incidence. This study is the first to address the contribution of home heating fuel and geographical course particulate matter (PM10) concentrations to lung cancer rates in New Hampshire, U.S. First, Pearson correlation analysis and Geographically weighted regression were used to investigate spatial relationships between outdoor PM10 and lung cancer rates. While the aforementioned analyses did not indicate a significant contribution of PM10 to lung cancer in the state, there was a trend towards a significant association in the northern and southwestern regions of the state. Second, case-control data were used to estimate the contributions of indoor pollution and second hand smoke to risk of lung cancer with adjustment for confounders. Increased risk was found among those who used wood or coal to heat their homes for more than 10 winters before the age of 18, with a significant increase in risk per winter. Resulting data suggest that further investigation of the relationship between heating-related air pollution levels and lung cancer risk is needed.
Genetic contributions to major depressive disorder (MDD) are thought to result from multiple genes interacting with each other. Different procedures have been proposed to detect such interactions. Which approach is best for explaining the risk of developing disease is unclear.
This study sought to elucidate the genetic interaction landscape in candidate genes for MDD by conducting a SNP-SNP interaction analysis using an exhaustive search through 3,704 SNP-markers in 1,732 cases and 1,783 controls provided from the GAIN MDD study. We used three different methods to detect interactions, two logistic regressions models (multiplicative and additive) and one data mining and machine learning (MDR) approach.
Although none of the interaction survived correction for multiple comparisons, the results provide important information for future genetic interaction studies in complex disorders. Among the 0.5% most significant observations, none had been reported previously for risk to MDD. Within this group of interactions, less than 0.03% would have been detectable based on main effect approach or an a priori algorithm. We evaluated correlations among the three different models and conclude that all three algorithms detected the same interactions to a low degree. Although the top interactions had a surprisingly large effect size for MDD (e.g. additive dominant model Puncorrected = 9.10E-9 with attributable proportion (AP) value = 0.58 and multiplicative recessive model with Puncorrected = 6.95E-5 with odds ratio (OR estimated from β3) value = 4.99) the area under the curve (AUC) estimates were low (< 0.54). Moreover, the population attributable fraction (PAF) estimates were also low (< 0.15).
We conclude that the top interactions on their own did not explain much of the genetic variance of MDD. The different statistical interaction methods we used in the present study did not identify the same pairs of interacting markers. Genetic interaction studies may uncover previously unsuspected effects that could provide novel insights into MDD risk, but much larger sample sizes are needed before this strategy can be powerfully applied.
Additive interaction; Multiplicative interaction; Logistic regression; Data mining and machine learning; Major depressive disorder
Alzheimer’s disease is the most common form of progressive dementia and there is currently no known cure. The cause of onset is not fully understood but genetic factors are expected to play a significant role. We present here a bioinformatics approach to the genetic analysis of grey matter density as an endophenotype for late onset Alzheimer’s disease. Our approach combines machine learning analysis of gene-gene interactions with large-scale functional genomics data for assessing biological relationships.
We found a statistically significant synergistic interaction among two SNPs located in the intergenic region of an olfactory gene cluster. This model did not replicate in an independent dataset. However, genes in this region have high-confidence biological relationships and are consistent with previous findings implicating sensory processes in Alzheimer’s disease.
Previous genetic studies of Alzheimer’s disease have revealed only a small portion of the overall variability due to DNA sequence differences. Some of this missing heritability is likely due to complex gene-gene and gene-environment interactions. We have introduced here a novel bioinformatics analysis pipeline that embraces the complexity of the genetic architecture of Alzheimer’s disease while at the same time harnessing the power of functional genomics. These findings represent novel hypotheses about the genetic basis of this complex disease and provide open-access methods that others can use in their own studies.
Many developmental, physiological, and behavioral processes depend on the precise expression of genes in space and time. Such spatiotemporal gene expression phenotypes arise from the binding of sequence-specific transcription factors (TFs) to DNA, and from the regulation of nearby genes that such binding causes. These nearby genes may themselves encode TFs, giving rise to a transcription factor network (TFN), wherein nodes represent TFs and directed edges denote regulatory interactions between TFs. Computational studies have linked several topological properties of TFNs — such as their degree distribution — with the robustness of a TFN's gene expression phenotype to genetic and environmental perturbation. Another important topological property is assortativity, which measures the tendency of nodes with similar numbers of edges to connect. In directed networks, assortativity comprises four distinct components that collectively form an assortativity signature. We know very little about how a TFN's assortativity signature affects the robustness of its gene expression phenotype to perturbation. While recent theoretical results suggest that increasing one specific component of a TFN's assortativity signature leads to increased phenotypic robustness, the biological context of this finding is currently limited because the assortativity signatures of real-world TFNs have not been characterized. It is therefore unclear whether these earlier theoretical findings are biologically relevant. Moreover, it is not known how the other three components of the assortativity signature contribute to the phenotypic robustness of TFNs. Here, we use publicly available DNaseI-seq data to measure the assortativity signatures of genome-wide TFNs in 41 distinct human cell and tissue types. We find that all TFNs share a common assortativity signature and that this signature confers phenotypic robustness to model TFNs. Lastly, we determine the extent to which each of the four components of the assortativity signature contributes to this robustness.
The cells of living organisms do not concurrently express their entire complement of genes. Instead, they regulate their gene expression, and one consequence of this is the potential for different cells to adopt different stable gene expression patterns. For example, the development of an embryo necessitates that cells alter their gene expression patterns in order to differentiate. These gene expression phenotypes are largely robust to genetic mutation, and one source of this robustness may reside in the network structure of interacting molecules that underlie genetic regulation. Theoretical studies of regulatory networks have linked network structure to robustness; however, it is also necessary to more extensively characterize real-world regulatory networks in order to understand which structural properties may be biologically meaningful. We recently used theoretical models to show that a particular structural property, degree assortativity, is linked to robustness. Here, we measure the assortativity of human regulatory networks in 41 distinct cell and tissue types. We then develop a theoretical framework to explore how this structural property affects robustness, and we find that the gene expression phenotypes of human regulatory networks are more robust than expected by chance alone.
Gene regulatory networks (GRNs) represent the interactions between genes and gene products, which drive the gene expression patterns that produce cellular phenotypes. GRNs display a number of characteristics that are beneficial for the development and evolution of organisms. For example, they are often robust to genetic perturbation, such as mutations in regulatory regions or loss of gene function. Simultaneously, GRNs are often evolvable as these genetic perturbations are occasionally exploited to innovate novel regulatory programs. Several topological properties, such as degree distribution, are known to influence the robustness and evolvability of GRNs. Assortativity, which measures the propensity of nodes of similar connectivity to connect to one another, is a separate topological property that has recently been shown to influence the robustness of GRNs to point mutations in cis-regulatory regions. However, it remains to be seen how assortativity may influence the robustness and evolvability of GRNs to other forms of genetic perturbation, such as gene birth via duplication or de novo origination. Here, we employ a computational model of genetic regulation to investigate whether the assortativity of a GRN influences its robustness and evolvability upon gene birth. We find that the robustness of a GRN generally increases with increasing assortativity, while its evolvability generally decreases. However, the rate of change in robustness outpaces that of evolvability, resulting in an increased proportion of assortative GRNs that are simultaneously robust and evolvable. By providing a mechanistic explanation for these observations, this work extends our understanding of how the assortativity of a GRN influences its robustness and evolvability upon gene birth.
Boolean networks; out-components; genetic regulation
In omic research, such as genome wide association studies, researchers seek to repeat their results in other datasets to reduce false positive findings and thus provide evidence for the existence of true associations. Unfortunately this standard validation approach cannot completely eliminate false positive conclusions, and it can also mask many true associations that might otherwise advance our understanding of pathology. These issues beg the question: How can we increase the amount of knowledge gained from high throughput genetic data? To address this challenge, we present an approach that complements standard statistical validation methods by drawing attention to both potential false negative and false positive conclusions, as well as providing broad information for directing future research. The Diverse Convergent Evidence approach (DiCE) we propose integrates information from multiple sources (omics, informatics, and laboratory experiments) to estimate the strength of the available corroborating evidence supporting a given association. This process is designed to yield an evidence metric that has utility when etiologic heterogeneity, variable risk factor frequencies, and a variety of observational data imperfections might lead to false conclusions. We provide proof of principle examples in which DiCE identified strong evidence for associations that have established biological importance, when standard validation methods alone did not provide support. If used as an adjunct to standard validation methods this approach can leverage multiple distinct data types to improve genetic risk factor discovery/validation, promote effective science communication, and guide future research directions.
Replication; Validation; Complex disease; Heterogeneity; GWAS; Omics; Type 2 error; Type 1 error; False negatives; False positives
Epistasis has been historically used to describe the phenomenon that the effect of a given gene on a phenotype can be dependent on one or more other genes, and is an essential element for understanding the association between genetic and phenotypic variations. Quantifying epistasis of orders higher than two is very challenging due to both the computational complexity of enumerating all possible combinations in genome-wide data and the lack of efficient and effective methodologies.
In this study, we propose a fast, non-parametric, and model-free measure for three-way epistasis.
Such a measure is based on information gain, and is able to separate all lower order effects from pure three-way epistasis.
Our method was verified on synthetic data and applied to real data from a candidate-gene study of tuberculosis in a West African population. In the tuberculosis data, we found a statistically significant pure three-way epistatic interaction effect that was stronger than any lower-order associations.
Our study provides a methodological basis for detecting and characterizing high-order gene-gene interactions in genetic association studies.
epistasis; information gain; gene-gene interaction; high-order interaction; genetic association studies
The statistical genetics phenomenon of epistasis is widely acknowledged to confound disease etiology. In order to evaluate strategies for detecting these complex multi-locus disease associations, simulation studies are required. The development of the GAMETES software for the generation of complex genetic models, has provided the means to randomly generate an architecturally diverse population of epistatic models that are both pure and strict, i.e. all n loci, but no fewer, are predictive of phenotype. Previous theoretical work characterizing complex genetic models has yet to examine pure, strict, epistasis which should be the most challenging to detect. This study addresses three goals: (1) Classify and characterize pure, strict, two-locus epistatic models, (2) Investigate the effect of model ‘architecture’ on detection difficulty, and (3) Explore how adjusting GAMETES constraints influences diversity in the generated models.
In this study we utilized a geometric approach to classify pure, strict, two-locus epistatic models by “shape”. In total, 33 unique shape symmetry classes were identified. Using a detection difficulty metric, we found that model shape was consistently a significant predictor of model detection difficulty. Additionally, after categorizing shape classes by the number of edges in their shape projections, we found that this edge number was also significantly predictive of detection difficulty. Analysis of constraints within GAMETES indicated that increasing model population size can expand model class coverage but does little to change the range of observed difficulty metric scores. A variable population prevalence significantly increased the range of observed difficulty metric scores and, for certain constraints, also improved model class coverage.
These analyses further our theoretical understanding of epistatic relationships and uncover guidelines for the effective generation of complex models using GAMETES. Specifically, (1) we have characterized 33 shape classes by edge number, detection difficulty, and observed frequency (2) our results support the claim that model architecture directly influences detection difficulty, and (3) we found that GAMETES will generate a maximally diverse set of models with a variable population prevalence and a larger model population size. However, a model population size as small as 1,000 is likely to be sufficient.
Epistasis; Models; Simulation; Genetics; GAMETES; Computational geometry; Convex hull
Molecularly targeted drugs promise a safer and more effective treatment modality than conventional chemotherapy for cancer patients. However, tumors are dynamic systems that readily adapt to these agents activating alternative survival pathways as they evolve resistant phenotypes. Combination therapies can overcome resistance but finding the optimal combinations efficiently presents a formidable challenge. Here we introduce a new paradigm for the design of combination therapy treatment strategies that exploits the tumor adaptive process to identify context-dependent essential genes as druggable targets.
We have developed a framework to mine high-throughput transcriptomic data, based on differential coexpression and Pareto optimization, to investigate drug-induced tumor adaptation. We use this approach to identify tumor-essential genes as druggable candidates. We apply our method to a set of ER+ breast tumor samples, collected before (n = 58) and after (n = 60) neoadjuvant treatment with the aromatase inhibitor letrozole, to prioritize genes as targets for combination therapy with letrozole treatment. We validate letrozole-induced tumor adaptation through coexpression and pathway analyses in an independent data set (n = 18).
We find pervasive differential coexpression between the untreated and letrozole-treated tumor samples as evidence of letrozole-induced tumor adaptation. Based on patterns of coexpression, we identify ten genes as potential candidates for combination therapy with letrozole including EPCAM, a letrozole-induced essential gene and a target to which drugs have already been developed as cancer therapeutics. Through replication, we validate six letrozole-induced coexpression relationships and confirm the epithelial-to-mesenchymal transition as a process that is upregulated in the residual tumor samples following letrozole treatment.
To derive the greatest benefit from molecularly targeted drugs it is critical to design combination treatment strategies rationally. Incorporating knowledge of the tumor adaptation process into the design provides an opportunity to match targeted drugs to the evolving tumor phenotype and surmount resistance.
Several different genetic and environmental factors have been identified as independent risk factors for bladder cancer in population-based studies. Recent studies have turned to understanding the role of gene-gene and gene-environment interactions in determining risk. We previously developed the bioinformatics framework of statistical epistasis networks (SEN) to characterize the global structure of interacting genetic factors associated with a particular disease or clinical outcome. By applying SEN to a population-based study of bladder cancer among Caucasians in New Hampshire, we were able to identify a set of connected genetic factors with strong and significant interaction effects on bladder cancer susceptibility.
To support our statistical findings using networks, in the present study, we performed pathway enrichment analyses on the set of genes identified using SEN, and found that they are associated with the carcinogen benzo[a]pyrene, a component of tobacco smoke. We further carried out an mRNA expression microarray experiment to validate statistical genetic interactions, and to determine if the set of genes identified in the SEN were differentially expressed in a normal bladder cell line and a bladder cancer cell line in the presence or absence of benzo[a]pyrene. Significant nonrandom sets of genes from the SEN were found to be differentially expressed in response to benzo[a]pyrene in both the normal bladder cells and the bladder cancer cells. In addition, the patterns of gene expression were significantly different between these two cell types.
The enrichment analyses and the gene expression microarray results support the idea that SEN analysis of bladder in population-based studies is able to identify biologically meaningful statistical patterns. These results bring us a step closer to a systems genetic approach to understanding cancer susceptibility that integrates population and laboratory-based studies.
Epistasis; Gene-gene interactions; Statistical epistasis networks; Benzo[a]pyrene; Gene-drug association; Bladder cancer
Low vitamin D status has been shown to be a risk factor for several metabolic traits such as obesity, diabetes and cardiovascular disease. The biological actions of 1, 25-dihydroxyvitamin D, are mediated through the vitamin D receptor (VDR), which heterodimerizes with retinoid X receptor, gamma (RXRG). Hence, we examined the potential interactions between the tagging polymorphisms in the VDR (22 tag SNPs) and RXRG (23 tag SNPs) genes on metabolic outcomes such as body mass index, waist circumference, waist-hip ratio (WHR), high- and low-density lipoprotein (LDL) cholesterols, serum triglycerides, systolic and diastolic blood pressures and glycated haemoglobin in the 1958 British Birth Cohort (1958BC, up to n = 5,231). We used Multifactor- dimensionality reduction (MDR) program as a non-parametric test to examine for potential interactions between the VDR and RXRG gene polymorphisms in the 1958BC. We used the data from Northern Finland Birth Cohort 1966 (NFBC66, up to n = 5,316) and Twins UK (up to n = 3,943) to replicate our initial findings from 1958BC.
After Bonferroni correction, the joint-likelihood ratio test suggested interactions on serum triglycerides (4 SNP - SNP pairs), LDL cholesterol (2 SNP - SNP pairs) and WHR (1 SNP - SNP pair) in the 1958BC. MDR permutation model testing analysis showed one two-way and one three-way interaction to be statistically significant on serum triglycerides in the 1958BC. In meta-analysis of results from two replication cohorts (NFBC66 and Twins UK, total n = 8,183), none of the interactions remained after correction for multiple testing (Pinteraction >0.17).
Our results did not provide strong evidence for interactions between allelic variations in VDR and RXRG genes on metabolic outcomes; however, further replication studies on large samples are needed to confirm our findings.
VDR; RXRG; SNPs; SNP-SNP interaction; 1958BC
Logistic regression has been the de facto, and often the only, model used in the description and analysis of relationships between a binary outcome and observed features. It is widely used to obtain the conditional probabilities of the outcome given predictors, as well as predictor effect size estimates using conditional odds ratios.
We show how statistical learning machines for binary outcomes, provably consistent for the nonparametric regression problem, can be used to provide both consistent conditional probability estimation and conditional effect size estimates. Effect size estimates from learning machines leverage our understanding of counterfactual arguments central to the interpretation of such estimates. We show that, if the data generating model is logistic, we can recover accurate probability predictions and effect size estimates with nearly the same efficiency as a correct logistic model, both for main effects and interactions. We also propose a method using learning machines to scan for possible interaction effects quickly and efficiently. Simulations using random forest probability machines are presented.
The models we propose make no assumptions about the data structure, and capture the patterns in the data by just specifying the predictors involved and not any particular model structure. So they do not run the same risks of model mis-specification and the resultant estimation biases as a logistic model. This methodology, which we call a “risk machine”, will share properties from the statistical machine that it is derived from.
Consistent nonparametric regression; Logistic regression; Probability machine; Odds ratio; Counterfactuals; Interactions
A central goal of human genetics is to identify and characterize susceptibility genes for common complex human diseases. An important challenge in this endeavor is the modeling of gene-gene interaction or epistasis that can result in non-additivity of genetic effects. The multifactor dimensionality reduction (MDR) method was developed as machine learning alternative to parametric logistic regression for detecting interactions in absence of significant marginal effects. The goal of MDR is to reduce the dimensionality inherent in modeling combinations of polymorphisms using a computational approach called constructive induction. Here, we propose a Robust Multifactor Dimensionality Reduction (RMDR) method that performs constructive induction using a Fisher’s Exact Test rather than a predetermined threshold. The advantage of this approach is that only those genotype combinations that are determined to be statistically significant are considered in the MDR analysis. We use two simulation studies to demonstrate that this approach will increase the success rate of MDR when there are only a few genotype combinations that are significantly associated with case-control status. We show that there is no loss of success rate when this is not the case. We then apply the RMDR method to the detection of gene-gene interactions in genotype data from a population-based study of bladder cancer in New Hampshire.
Complex interactions among genes and environmental factors are known to play a role in common human disease aetiology. There is a growing body of evidence to suggest that complex interactions are ‘the norm’ and, rather than amounting to a small perturbation to classical Mendelian genetics, interactions may be the predominant effect. Traditional statistical methods are not well suited for detecting such interactions, especially when the data are high dimensional (many attributes or independent variables) or when interactions occur between more than two polymorphisms. In this review, we discuss machine-learning models and algorithms for identifying and characterising susceptibility genes in common, complex, multifactorial human diseases. We focus on the following machine-learning methods that have been used to detect gene-gene interactions: neural networks, cellular automata, random forests, and multifactor dimensionality reduction. We conclude with some ideas about how these methods and others can be integrated into a comprehensive and flexible framework for data mining and knowledge discovery in human genetics.