Motivation: Gene set enrichment has become a critical tool for interpreting the results of high-throughput genomic experiments. Inconsistent annotation quality and lack of annotation specificity, however, limit the statistical power of enrichment methods and make it difficult to replicate enrichment results across biologically similar datasets.
Results: We propose a novel algorithm for optimizing gene set annotations to best match the structure of specific empirical data sources. Our proposed method, entropy minimization over variable clusters (EMVC), filters the annotations for each gene set to minimize a measure of entropy across disjoint gene clusters computed for a range of cluster sizes over multiple bootstrap resampled datasets. As shown using simulated gene sets with simulated data and Molecular Signatures Database collections with microarray gene expression data, the EMVC algorithm accurately filters annotations unrelated to the experimental outcome resulting in increased gene set enrichment power and better replication of enrichment results.
Availability and implementation:
Supplementary data are available at Bioinformatics online.
Motivation: Epistasis, the presence of gene–gene interactions, has been hypothesized to be at the root of many common human diseases, but current genome-wide association studies largely ignore its role. Multifactor dimensionality reduction (MDR) is a powerful model-free method for detecting epistatic relationships between genes, but computational costs have made its application to genome-wide data difficult. Graphics processing units (GPUs), the hardware responsible for rendering computer games, are powerful parallel processors. Using GPUs to run MDR on a genome-wide dataset allows for statistically rigorous testing of epistasis.
Results: The implementation of MDR for GPUs (MDRGPU) includes core features of the widely used Java software package, MDR. This GPU implementation allows for large-scale analysis of epistasis at a dramatically lower cost than the standard CPU-based implementations. As a proof-of-concept, we applied this software to a genome-wide study of sporadic amyotrophic lateral sclerosis (ALS). We discovered a statistically significant two-SNP classifier and subsequently replicated the significance of these two SNPs in an independent study of ALS. MDRGPU makes the large-scale analysis of epistasis tractable and opens the door to statistically rigorous testing of interactions in genome-wide datasets.
Availability: MDRGPU is open source and available free of charge from http://www.sourceforge.net/projects/mdr.
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: The sequencing of the human genome has made it possible to identify an informative set of >1 million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWASs). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving health care through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype–phenotype relationship that is characterized by significant heterogeneity and gene–gene and gene–environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods.
As the cost of genome-wide genotyping decreases, the number of genome-wide association studies (GWAS) has increased considerably. However, the transition from GWAS findings to the underlying biology of various phenotypes remains challenging. As a result, due to its system-level interpretability, pathway analysis has become a popular tool for gaining insights on the underlying biology from high-throughput genetic association data. In pathway analyses, gene sets representing particular biological processes are tested for significant associations with a given phenotype. Most existing pathway analysis approaches rely on single-marker statistics and assume that pathways are independent of each other. As biological systems are driven by complex biomolecular interactions, embracing the complex relationships between single-nucleotide polymorphisms (SNPs) and pathways needs to be addressed. To incorporate the complexity of gene-gene interactions and pathway-pathway relationships, we propose a system-level pathway analysis approach, synthetic feature random forest (SF-RF), which is designed to detect pathway-phenotype associations without making assumptions about the relationships among SNPs or pathways. In our approach, the genotypes of SNPs in a particular pathway are aggregated into a synthetic feature representing that pathway via Random Forest (RF). Multiple synthetic features are analyzed using RF simultaneously and the significance of a synthetic feature indicates the significance of the corresponding pathway. We further complement SF-RF with pathway-based Statistical Epistasis Network (SEN) analysis that evaluates interactions among pathways. By investigating the pathway SEN, we hope to gain additional insights into the genetic mechanisms contributing to the pathway-phenotype association. We apply SF-RF to a population-based genetic study of bladder cancer and further investigate the mechanisms that help explain the pathway-phenotype associations using SEN. The bladder cancer associated pathways we found are both consistent with existing biological knowledge and reveal novel and plausible hypotheses for future biological validations.
interactions; epistasis; pathway analysis; synthetic feature random forest (SF-RF); statistical epistasis network (SEN)
Many colleges and universities across the globe now offer bachelors, masters, and doctoral degrees, along with certificate programs in bioinformatics. While there is some consensus surrounding curricula competencies, programs vary greatly in their core foci, with some leaning heavily toward the biological sciences and others toward quantitative areas. This allows prospective students to choose a program that best fits their interests and career goals. In the digital age, most scientific fields are facing an enormous growth of data, and as a consequence, the goals and challenges of bioinformatics are rapidly changing; this requires that bioinformatics education also change. In this workshop, we seek to ascertain current trends in bioinformatics education by asking the question, “What are the core competencies all bioinformaticians should have at the end of their training, and how successful have programs been in placing students in desired careers?”
The non-linear interaction effect among multiple genetic factors, i.e. epistasis, has been recognized as a key component in understanding the underlying genetic basis of complex human diseases and phenotypic traits. Due to the statistical and computational complexity, most epistasis studies are limited to interactions with an order of two. We developed ViSEN to analyze and visualize epistatic interactions of both two-way and three-way. ViSEN not only identifies strong interactions among pairs or trios of genetic attributes, but also provides a global interaction map that shows neighborhood and clustering structures. This visualized information could be very helpful to infer the underlying genetic architecture of complex diseases and to generate plausible hypotheses for further biological validations. ViSEN is implemented in Java and freely available at https://sourceforge.net/projects/visen/.
epistasis; gene-gene interaction; high-order interaction; networks; visualization; software; genome-wide association; complex diseases
Indoor and outdoor air pollution is known to contribute to increased lung cancer incidence. This study is the first to address the contribution of home heating fuel and geographical course particulate matter (PM10) concentrations to lung cancer rates in New Hampshire, U.S. First, Pearson correlation analysis and Geographically weighted regression were used to investigate spatial relationships between outdoor PM10 and lung cancer rates. While the aforementioned analyses did not indicate a significant contribution of PM10 to lung cancer in the state, there was a trend towards a significant association in the northern and southwestern regions of the state. Second, case-control data were used to estimate the contributions of indoor pollution and second hand smoke to risk of lung cancer with adjustment for confounders. Increased risk was found among those who used wood or coal to heat their homes for more than 10 winters before the age of 18, with a significant increase in risk per winter. Resulting data suggest that further investigation of the relationship between heating-related air pollution levels and lung cancer risk is needed.
Epistasis has been historically used to describe the phenomenon that the effect of a given gene on a phenotype can be dependent on one or more other genes, and is an essential element for understanding the association between genetic and phenotypic variations. Quantifying epistasis of orders higher than two is very challenging due to both the computational complexity of enumerating all possible combinations in genome-wide data and the lack of efficient and effective methodologies.
In this study, we propose a fast, non-parametric, and model-free measure for three-way epistasis.
Such a measure is based on information gain, and is able to separate all lower order effects from pure three-way epistasis.
Our method was verified on synthetic data and applied to real data from a candidate-gene study of tuberculosis in a West African population. In the tuberculosis data, we found a statistically significant pure three-way epistatic interaction effect that was stronger than any lower-order associations.
Our study provides a methodological basis for detecting and characterizing high-order gene-gene interactions in genetic association studies.
epistasis; information gain; gene-gene interaction; high-order interaction; genetic association studies
Whether your interests lie in scientific arenas, the corporate world, or in government, you have certainly heard the praises of big data: Big data will give you new insights, allow you to become more efficient, and/or will solve your problems. While big data has had some outstanding successes, many are now beginning to see that it is not the Silver Bullet that it has been touted to be. Here our main concern is the overall impact of big data; the current manifestation of big data is constructing a Maginot Line in science in the 21st century. Big data is not “lots of data” as a phenomena anymore; The big data paradigm is putting the spirit of the Maginot Line into lots of data. Big data overall is disconnecting researchers and science challenges. We propose No-Boundary Thinking (NBT), applying no-boundary thinking in problem defining to address science challenges.
Big data; Maginot Line; No-Boundary thinking
A central goal of human genetics is to identify and characterize susceptibility genes for common complex human diseases. An important challenge in this endeavor is the modeling of gene-gene interaction or epistasis that can result in non-additivity of genetic effects. The multifactor dimensionality reduction (MDR) method was developed as machine learning alternative to parametric logistic regression for detecting interactions in absence of significant marginal effects. The goal of MDR is to reduce the dimensionality inherent in modeling combinations of polymorphisms using a computational approach called constructive induction. Here, we propose a Robust Multifactor Dimensionality Reduction (RMDR) method that performs constructive induction using a Fisher’s Exact Test rather than a predetermined threshold. The advantage of this approach is that only those genotype combinations that are determined to be statistically significant are considered in the MDR analysis. We use two simulation studies to demonstrate that this approach will increase the success rate of MDR when there are only a few genotype combinations that are significantly associated with case-control status. We show that there is no loss of success rate when this is not the case. We then apply the RMDR method to the detection of gene-gene interactions in genotype data from a population-based study of bladder cancer in New Hampshire.
In gene regulatory circuits, the expression of individual genes is commonly modulated by a set of regulating gene products, which bind to a gene’s cis-regulatory region. This region encodes an input-output function, referred to as signal-integration logic, that maps a specific combination of regulatory signals (inputs) to a particular expression state (output) of a gene. The space of all possible signal-integration functions is vast and the mapping from input to output is many-to-one: for the same set of inputs, many functions (genotypes) yield the same expression output (phenotype). Here, we exhaustively enumerate the set of signal-integration functions that yield idential gene expression patterns within a computational model of gene regulatory circuits. Our goal is to characterize the relationship between robustness and evolvability in the signal-integration space of regulatory circuits, and to understand how these properties vary between the genotypic and phenotypic scales. Among other results, we find that the distributions of genotypic robustness are skewed, such that the majority of signal-integration functions are robust to perturbation. We show that the connected set of genotypes that make up a given phenotype are constrained to specific regions of the space of all possible signal-integration functions, but that as the distance between genotypes increases, so does their capacity for unique innovations. In addition, we find that robust phenotypes are (i) evolvable, (ii) easily identified by random mutation, and (iii) mutationally biased toward other robust phenotypes. We explore the implications of these latter observations for mutation-based evolution by conducting random walks between randomly chosen source and target phenotypes. We demonstrate that the time required to identify the target phenotype is independent of the properties of the source phenotype.
Evolutionary Innovation; Random Boolean Circuit; Genetic Regulation; Genotype Networks; Genotype-Phenotype Map
Complex interactions among genes and environmental factors are known to play a role in common human disease aetiology. There is a growing body of evidence to suggest that complex interactions are ‘the norm’ and, rather than amounting to a small perturbation to classical Mendelian genetics, interactions may be the predominant effect. Traditional statistical methods are not well suited for detecting such interactions, especially when the data are high dimensional (many attributes or independent variables) or when interactions occur between more than two polymorphisms. In this review, we discuss machine-learning models and algorithms for identifying and characterising susceptibility genes in common, complex, multifactorial human diseases. We focus on the following machine-learning methods that have been used to detect gene-gene interactions: neural networks, cellular automata, random forests, and multifactor dimensionality reduction. We conclude with some ideas about how these methods and others can be integrated into a comprehensive and flexible framework for data mining and knowledge discovery in human genetics.
Permutation-based statistics for evaluating the significance of class prediction, predictive attributes, and patterns of association have only appeared within the learning classifier system (LCS) literature since 2012. While still not widely utilized by the LCS research community, formal evaluations of test statistic confidence are imperative to large and complex real world applications such as genetic epidemiology where it is standard practice to quantify the likelihood that a seemingly meaningful statistic could have been obtained purely by chance. LCS algorithms are relatively computationally expensive on their own. The compounding requirements for generating permutation-based statistics may be a limiting factor for some researchers interested in applying LCS algorithms to real world problems. Technology has made LCS parallelization strategies more accessible and thus more popular in recent years. In the present study we examine the benefits of externally parallelizing a series of independent LCS runs such that permutation testing with cross validation becomes more feasible to complete on a single multi-core workstation. We test our python implementation of this strategy in the context of a simulated complex genetic epidemiological data mining problem. Our evaluations indicate that as long as the number of concurrent processes does not exceed the number of CPU cores, the speedup achieved is approximately linear.
Algorithms; Performance; Design; LCS; significance testing; parallelization; scalability; multi-core processors
Epistasis or gene-gene interaction is a fundamental component of the genetic architecture of complex traits such as disease susceptibility. Multifactor dimensionality reduction (MDR) was developed as a nonparametric and model-free method to detect epistasis when there are no significant marginal genetic effects. However, in many studies of complex disease, other covariates like age of onset and smoking status could have a strong main effect and may potentially interfere with MDR's ability to achieve its goal. In this paper, we present a simple and computationally efficient sampling method to adjust for covariate effects in MDR. We use simulation to show that after adjustment, MDR has sufficient power to detect true gene-gene interactions. We also compare our method with the state-of-art technique in covariate adjustment. The results suggest that our proposed method performs similarly, but is more computationally efficient. We then apply this new method to an analysis of a population-based bladder cancer study in New Hampshire.
Covariate adjustment; Multifactor dimensionality reduction; Epistasis
Pleiotropy, in which one mutation causes multiple phenotypes, has traditionally been seen as a deviation from the conventional observation in which one gene affects one phenotype. Epistasis, or gene-gene interaction, has also been treated as an exception to the Mendelian one gene-one phenotype paradigm. This simplified perspective belies the pervasive complexity of biology and hinders progress toward a deeper understanding of biological systems. We assert that epistasis and pleiotropy are not isolated occurrences, but ubiquitous and inherent properties of biomolecular networks. These phenomena should not be treated as exceptions, but rather as fundamental components of genetic analyses. A systems level understanding of epistasis and pleiotropy is, therefore, critical to furthering our understanding of human genetics and its contribution to common human disease. Finally, graph theory offers an intuitive and powerful set of tools with which to study the network bases of these important genetic phenomena.
epistasis; pleiotropy; gene-gene interactions; phenotype; canalization; scale-free network; complex disease; systems biology
Genetic contributions to major depressive disorder (MDD) are thought to result from multiple genes interacting with each other. Different procedures have been proposed to detect such interactions. Which approach is best for explaining the risk of developing disease is unclear.
This study sought to elucidate the genetic interaction landscape in candidate genes for MDD by conducting a SNP-SNP interaction analysis using an exhaustive search through 3,704 SNP-markers in 1,732 cases and 1,783 controls provided from the GAIN MDD study. We used three different methods to detect interactions, two logistic regressions models (multiplicative and additive) and one data mining and machine learning (MDR) approach.
Although none of the interaction survived correction for multiple comparisons, the results provide important information for future genetic interaction studies in complex disorders. Among the 0.5% most significant observations, none had been reported previously for risk to MDD. Within this group of interactions, less than 0.03% would have been detectable based on main effect approach or an a priori algorithm. We evaluated correlations among the three different models and conclude that all three algorithms detected the same interactions to a low degree. Although the top interactions had a surprisingly large effect size for MDD (e.g. additive dominant model Puncorrected = 9.10E-9 with attributable proportion (AP) value = 0.58 and multiplicative recessive model with Puncorrected = 6.95E-5 with odds ratio (OR estimated from β3) value = 4.99) the area under the curve (AUC) estimates were low (< 0.54). Moreover, the population attributable fraction (PAF) estimates were also low (< 0.15).
We conclude that the top interactions on their own did not explain much of the genetic variance of MDD. The different statistical interaction methods we used in the present study did not identify the same pairs of interacting markers. Genetic interaction studies may uncover previously unsuspected effects that could provide novel insights into MDD risk, but much larger sample sizes are needed before this strategy can be powerfully applied.
Additive interaction; Multiplicative interaction; Logistic regression; Data mining and machine learning; Major depressive disorder
Alzheimer’s disease is the most common form of progressive dementia and there is currently no known cure. The cause of onset is not fully understood but genetic factors are expected to play a significant role. We present here a bioinformatics approach to the genetic analysis of grey matter density as an endophenotype for late onset Alzheimer’s disease. Our approach combines machine learning analysis of gene-gene interactions with large-scale functional genomics data for assessing biological relationships.
We found a statistically significant synergistic interaction among two SNPs located in the intergenic region of an olfactory gene cluster. This model did not replicate in an independent dataset. However, genes in this region have high-confidence biological relationships and are consistent with previous findings implicating sensory processes in Alzheimer’s disease.
Previous genetic studies of Alzheimer’s disease have revealed only a small portion of the overall variability due to DNA sequence differences. Some of this missing heritability is likely due to complex gene-gene and gene-environment interactions. We have introduced here a novel bioinformatics analysis pipeline that embraces the complexity of the genetic architecture of Alzheimer’s disease while at the same time harnessing the power of functional genomics. These findings represent novel hypotheses about the genetic basis of this complex disease and provide open-access methods that others can use in their own studies.
Proteomics and the study of protein–protein interactions are becoming increasingly important in our effort to understand human diseases on a system-wide level. Thanks to the development and curation of protein-interaction databases, up-to-date information on these interaction networks is accessible and publicly available to the scientific community. As our knowledge of protein–protein interactions increases, it is important to give thought to the different ways that these resources can impact biomedical research. In this article, we highlight the importance of protein–protein interactions in human genetics and genetic epidemiology. Since protein–protein interactions demonstrate one of the strongest functional relationships between genes, combining genomic data with available proteomic data may provide us with a more in-depth understanding of common human diseases. In this review, we will discuss some of the fundamentals of protein interactions, the databases that are publicly available and how information from these databases can be used to facilitate genome-wide genetic studies.
epistasis; expert knowledge; multifactor dimensionality reduction; protein–protein interaction; single nucleotide polymorphism
Many developmental, physiological, and behavioral processes depend on the precise expression of genes in space and time. Such spatiotemporal gene expression phenotypes arise from the binding of sequence-specific transcription factors (TFs) to DNA, and from the regulation of nearby genes that such binding causes. These nearby genes may themselves encode TFs, giving rise to a transcription factor network (TFN), wherein nodes represent TFs and directed edges denote regulatory interactions between TFs. Computational studies have linked several topological properties of TFNs — such as their degree distribution — with the robustness of a TFN's gene expression phenotype to genetic and environmental perturbation. Another important topological property is assortativity, which measures the tendency of nodes with similar numbers of edges to connect. In directed networks, assortativity comprises four distinct components that collectively form an assortativity signature. We know very little about how a TFN's assortativity signature affects the robustness of its gene expression phenotype to perturbation. While recent theoretical results suggest that increasing one specific component of a TFN's assortativity signature leads to increased phenotypic robustness, the biological context of this finding is currently limited because the assortativity signatures of real-world TFNs have not been characterized. It is therefore unclear whether these earlier theoretical findings are biologically relevant. Moreover, it is not known how the other three components of the assortativity signature contribute to the phenotypic robustness of TFNs. Here, we use publicly available DNaseI-seq data to measure the assortativity signatures of genome-wide TFNs in 41 distinct human cell and tissue types. We find that all TFNs share a common assortativity signature and that this signature confers phenotypic robustness to model TFNs. Lastly, we determine the extent to which each of the four components of the assortativity signature contributes to this robustness.
The cells of living organisms do not concurrently express their entire complement of genes. Instead, they regulate their gene expression, and one consequence of this is the potential for different cells to adopt different stable gene expression patterns. For example, the development of an embryo necessitates that cells alter their gene expression patterns in order to differentiate. These gene expression phenotypes are largely robust to genetic mutation, and one source of this robustness may reside in the network structure of interacting molecules that underlie genetic regulation. Theoretical studies of regulatory networks have linked network structure to robustness; however, it is also necessary to more extensively characterize real-world regulatory networks in order to understand which structural properties may be biologically meaningful. We recently used theoretical models to show that a particular structural property, degree assortativity, is linked to robustness. Here, we measure the assortativity of human regulatory networks in 41 distinct cell and tissue types. We then develop a theoretical framework to explore how this structural property affects robustness, and we find that the gene expression phenotypes of human regulatory networks are more robust than expected by chance alone.
Gene regulatory networks (GRNs) represent the interactions between genes and gene products, which drive the gene expression patterns that produce cellular phenotypes. GRNs display a number of characteristics that are beneficial for the development and evolution of organisms. For example, they are often robust to genetic perturbation, such as mutations in regulatory regions or loss of gene function. Simultaneously, GRNs are often evolvable as these genetic perturbations are occasionally exploited to innovate novel regulatory programs. Several topological properties, such as degree distribution, are known to influence the robustness and evolvability of GRNs. Assortativity, which measures the propensity of nodes of similar connectivity to connect to one another, is a separate topological property that has recently been shown to influence the robustness of GRNs to point mutations in cis-regulatory regions. However, it remains to be seen how assortativity may influence the robustness and evolvability of GRNs to other forms of genetic perturbation, such as gene birth via duplication or de novo origination. Here, we employ a computational model of genetic regulation to investigate whether the assortativity of a GRN influences its robustness and evolvability upon gene birth. We find that the robustness of a GRN generally increases with increasing assortativity, while its evolvability generally decreases. However, the rate of change in robustness outpaces that of evolvability, resulting in an increased proportion of assortative GRNs that are simultaneously robust and evolvable. By providing a mechanistic explanation for these observations, this work extends our understanding of how the assortativity of a GRN influences its robustness and evolvability upon gene birth.
Boolean networks; out-components; genetic regulation
Complex diseases such as cancer and heart disease result from interactions between an individual's genetics and environment, i.e. their human ecology. Rates of complex diseases have consistently demonstrated geographic patterns of incidence, or spatial “clusters” of increased incidence relative to the general population. Likewise, genetic subpopulations and environmental influences are not evenly distributed across space. Merging appropriate methods from genetic epidemiology, ecology and geography will provide a more complete understanding of the spatial interactions between genetics and environment that result in spatial patterning of disease rates. Geographic Information Systems (GIS), which are tools designed specifically for dealing with geographic data and performing spatial analyses to determine their relationship, are key to this kind of data integration. Here the authors introduce a new interdisciplinary paradigm, ecogeographic genetic epidemiology, which uses GIS and spatial statistical analyses to layer genetic subpopulation and environmental data with disease rates and thereby discern the complex gene-environment interactions which result in spatial patterns of incidence.
Geographic Information Systems; Environmental Health; Population Genetics; Spatial Genetics; Medical Geography; Landscape Genetics
Multifactor dimensionality reduction (MDR) was developed as a nonparametric and model-free data mining method for detecting, characterizing, and interpreting epistasis in the absence of significant main effects in genetic and epidemiologic studies of complex traits such as disease susceptibility. The goal of MDR is to change the representation of the data using a constructive induction algorithm to make nonadditive interactions easier to detect using any classification method such as naïve Bayes or logistic regression. Traditionally, MDR constructed variables have been evaluated with a naïve Bayes classifier that is combined with 10-fold cross validation to obtain an estimate of predictive accuracy or generalizability of epistasis models. Traditionally, we have used permutation testing to statistically evaluate the significance of models obtained through MDR. The advantage of permutation testing is that it controls for false-positives due to multiple testing. The disadvantage is that permutation testing is computationally expensive. This is in an important issue that arises in the context of detecting epistasis on a genome-wide scale. The goal of the present study was to develop and evaluate several alternatives to large-scale permutation testing for assessing the statistical significance of MDR models. Using data simulated from 70 different epistasis models, we compared the power and type I error rate of MDR using a 1000-fold permutation test with hypothesis testing using an extreme value distribution (EVD). We find that this new hypothesis testing method provides a reasonable alternative to the computationally expensive 1000-fold permutation test and is 50 times faster. We then demonstrate this new method by applying it to a genetic epidemiology study of bladder cancer susceptibility that was previously analyzed using MDR and assessed using a 1000-fold permutation test.
Extreme Value Distribution; Permutation Testing; Power; Type I Error; Bladder Cancer; Data Mining
In omic research, such as genome wide association studies, researchers seek to repeat their results in other datasets to reduce false positive findings and thus provide evidence for the existence of true associations. Unfortunately this standard validation approach cannot completely eliminate false positive conclusions, and it can also mask many true associations that might otherwise advance our understanding of pathology. These issues beg the question: How can we increase the amount of knowledge gained from high throughput genetic data? To address this challenge, we present an approach that complements standard statistical validation methods by drawing attention to both potential false negative and false positive conclusions, as well as providing broad information for directing future research. The Diverse Convergent Evidence approach (DiCE) we propose integrates information from multiple sources (omics, informatics, and laboratory experiments) to estimate the strength of the available corroborating evidence supporting a given association. This process is designed to yield an evidence metric that has utility when etiologic heterogeneity, variable risk factor frequencies, and a variety of observational data imperfections might lead to false conclusions. We provide proof of principle examples in which DiCE identified strong evidence for associations that have established biological importance, when standard validation methods alone did not provide support. If used as an adjunct to standard validation methods this approach can leverage multiple distinct data types to improve genetic risk factor discovery/validation, promote effective science communication, and guide future research directions.
Replication; Validation; Complex disease; Heterogeneity; GWAS; Omics; Type 2 error; Type 1 error; False negatives; False positives