Search tips
Search criteria

Results 1-9 (9)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Differential co-expression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scale-free structure 
BioData Mining  2015;8:5.
Biological insights into group differences, such as disease status, have been achieved through differential co-expression analysis of microarray data. Additional understanding of group differences may be achieved by integrating the connectivity structure of the differential co-expression network and per-gene differential expression between phenotypic groups. Such a global differential co-expression network strategy may increase sensitivity to detect gene-gene interactions (or expression epistasis) that may act as candidates for rewiring susceptibility co-expression networks.
We test two methods for inferring Genetic Association Interaction Networks (GAIN) incorporating both differential co-expression effects and differential expression effects: a generalized linear model (GLM) regression method with interaction effects (reGAIN) and a Fisher test method for correlation differences (dcGAIN). We rank the importance of each gene with complete interaction network centrality (CINC), which integrates each gene’s differential co-expression effects in the GAIN model along with each gene’s individual differential expression measure. We compare these methods with statistical learning methods Relief-F, Random Forests and Lasso. We also develop a mixture model and permutation approach for determining significant importance score thresholds for network centralities, Relief-F and Random Forest. We introduce a novel simulation strategy that generates microarray case–control data with embedded differential co-expression networks and underlying correlation structure based on scale-free or Erdos-Renyi (ER) random networks.
Using the network simulation strategy, we find that Relief-F and reGAIN provide the best balance between detecting interactions and main effects, plus reGAIN has the ability to adjust for covariates and model quantitative traits. The dcGAIN approach performs best at finding differential co-expression effects by design but worst for main effects, and it does not adjust for covariates and is limited to dichotomous outcomes. When the underlying network is scale free instead of ER, all interaction network methods have greater power to find differential co-expression effects. We apply these methods to a public microarray study of the differential immune response to influenza vaccine, and we identify effects that suggest a role in influenza vaccine immune response for genes from the PI3K family, which includes genes with known immunodeficiency function, and KLRG1, which is a known marker of senescence.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-015-0040-x) contains supplementary material, which is available to authorized users.
PMCID: PMC4326454
2.  Encore: Genetic Association Interaction Network Centrality Pipeline and Application to SLE Exome Data 
Genetic epidemiology  2013;37(6):614-621.
Open source tools are needed to facilitate the construction, analysis, and visualization of gene-gene interaction networks for sequencing data. To address this need, we present Encore, an open source network analysis pipeline for GWAS and rare variant data. Encore constructs Genetic Association Interaction Networks or Epistasis Networks using two optional approaches: our previous information-theory method or a generalized linear model approach. Additionally, Encore includes multiple data filtering options, including Random Forest/Random Jungle for main effect enrichment and Evaporative Cooling and Relief-F filters for enrichment of interaction effects. Encore implements SNPrank network centrality for identifying susceptibility hubs (nodes containing a large amount of disease susceptibility information through the combination of multivariate main effects and multiple gene-gene interactions in the network), and it provides appropriate files for interactive visualization of a network using tools from our online Galaxy instance. We implemented these algorithms in C++ using OpenMP for shared-memory parallel analysis on a server or desktop. To demonstrate Encore’s utility in analysis of genetic sequencing data, we present an analysis of exome resequencing data from healthy individuals and those with Systemic Lupus Erythematous (SLE). Our results verify the importance of the previously associated SLE genes HLA-DRB and NCF2, and these two genes had the highest gene-gene interaction degrees among the susceptibility hubs. An additional 14 genes previously associated with SLE emerged in our epistasis network model of the exome data, and three novel candidate genes, ST8SIA4, CMTM4, and C2CD4B, were implicated in the model. In summary, we present a comprehensive tool for epistasis network analysis and the first such analysis of exome data from a genetic study of SLE.
Software Availability:
PMCID: PMC3955726  PMID: 23740754
epistasis network; machine learning; network analysis; network centrality; Systemic Lupus Erythematous
3.  Sensible Initialization Using Expert Knowledge for Genome-Wide Analysis of Epistasis Using Genetic Programming 
In human genetics it is now possible to measure large numbers of DNA sequence variations across the human genome. Given current knowledge about biological networks and disease processes it seems likely that disease risk can best be modeled by interactions between biological components, which may be examined as interacting DNA sequence variations. The machine learning challenge is to effectively explore interactions in these datasets to identify combinations of variations which are predictive of common human diseases. Genetic programming is a promising approach to this problem. The goal of this study is to examine the role that an expert knowledge aware initializer can play in the framework of genetic programming. We show that this expert knowledge aware initializer outperforms both a random initializer and an enumerative initializer.
PMCID: PMC3012376  PMID: 21197156
4.  ReliefSeq: A Gene-Wise Adaptive-K Nearest-Neighbor Feature Selection Tool for Finding Gene-Gene Interactions and Main Effects in mRNA-Seq Gene Expression Data 
PLoS ONE  2013;8(12):e81527.
Relief-F is a nonparametric, nearest-neighbor machine learning method that has been successfully used to identify relevant variables that may interact in complex multivariate models to explain phenotypic variation. While several tools have been developed for assessing differential expression in sequence-based transcriptomics, the detection of statistical interactions between transcripts has received less attention in the area of RNA-seq analysis. We describe a new extension and assessment of Relief-F for feature selection in RNA-seq data. The ReliefSeq implementation adapts the number of nearest neighbors (k) for each gene to optimize the Relief-F test statistics (importance scores) for finding both main effects and interactions. We compare this gene-wise adaptive-k (gwak) Relief-F method with standard RNA-seq feature selection tools, such as DESeq and edgeR, and with the popular machine learning method Random Forests. We demonstrate performance on a panel of simulated data that have a range of distributional properties reflected in real mRNA-seq data including multiple transcripts with varying sizes of main effects and interaction effects. For simulated main effects, gwak-Relief-F feature selection performs comparably to standard tools DESeq and edgeR for ranking relevant transcripts. For gene-gene interactions, gwak-Relief-F outperforms all comparison methods at ranking relevant genes in all but the highest fold change/highest signal situations where it performs similarly. The gwak-Relief-F algorithm outperforms Random Forests for detecting relevant genes in all simulation experiments. In addition, Relief-F is comparable to the other methods based on computational time. We also apply ReliefSeq to an RNA-Seq study of smallpox vaccine to identify gene expression changes between vaccinia virus-stimulated and unstimulated samples. ReliefSeq is an attractive tool for inclusion in the suite of tools used for analysis of mRNA-Seq data; it has power to detect both main effects and interaction effects. Software Availability:
PMCID: PMC3858248  PMID: 24339943
5.  Application of Genetic Algorithms to the Discovery of Complex Models for Simulation Studies in Human Genetics 
Simulation studies are useful in various disciplines for a number of reasons including the development and evaluation of new computational and statistical methods. This is particularly true in human genetics and genetic epidemiology where new analytical methods are needed for the detection and characterization of disease susceptibility genes whose effects are complex, nonlinear, and partially or solely dependent on the effects of other genes. Despite this need, the development of complex genetic models that can be used to simulate data is not always intuitive. In fact, only a few such models have been published. In this paper, we present a strategy for identifying complex genetic models for simulation studies that utilizes genetic algorithms. The genetic models used in this study are penetrance functions that define the probability of disease given a specific DNA sequence variation has been inherited. We demonstrate that the genetic algorithm approach routinely identifies interesting and useful penetrance functions in a human-competitve manner.
PMCID: PMC3569849  PMID: 23413413
6.  Mask Functions for the Symbolic Modeling of Epistasis Using Genetic Programming 
The study of common, complex multifactorial diseases in genetic epidemiology is complicated by nonlinearity in the genotype-to-phenotype mapping relationship that is due, in part, to epistasis or gene-gene interactions. Symobolic discriminant analysis (SDA) is a flexible modeling approach which uses genetic programming (GP) to evolve an optimal predictive model using a predefined collection of mathematical functions, constants, and attributes. This has been shown to be an effective strategy for modeling epistasis. In the present study, we introduce the genetic “mask” as a novel building block which exploits expert knowledge in the form of a pre-constructed relationship between two attributes. The goal of this study was to determine whether the availability of “mask” building blocks improves SDA performance. The results of this study support the idea that pre-processing data improves GP performance.
PMCID: PMC3457012  PMID: 23019565
Genetic Analysis; Genetic Epidemiology; Genetic Programming; Symbolic Discriminant Analysis; Symbolic Regression; Function Set; Two-Locus Model; Genetic Mask
7.  Routine Discovery of Complex Genetic Models using Genetic Algorithms 
Applied soft computing  2004;4(1):79-86.
Simulation studies are useful in various disciplines for a number of reasons including the development and evaluation of new computational and statistical methods. This is particularly true in human genetics and genetic epidemiology where new analytical methods are needed for the detection and characterization of disease susceptibility genes whose effects are complex, nonlinear, and partially or solely dependent on the effects of other genes (i.e. epistasis or gene-gene interaction). Despite this need, the development of complex genetic models that can be used to simulate data is not always intuitive. In fact, only a few such models have been published. We have previously developed a genetic algorithm approach to discovering complex genetic models in which two single nucleotide polymorphisms (SNPs) influence disease risk solely through nonlinear interactions. In this paper, we extend this approach for the discovery of high-order epistasis models involving three to five SNPs. We demonstrate that the genetic algorithm is capable of routinely discovering interesting high-order epistasis models in which each SNP influences risk of disease only through interactions with the other SNPs in the model. This study opens the door for routine simulation of complex gene-gene interactions among SNPs for the development and evaluation of new statistical and computational approaches for identifying common, complex multifactorial disease susceptibility genes.
PMCID: PMC2952957  PMID: 20948983
Gene-Gene Interactions; Simulation; Penetrance; Genetic Epidemiology
8.  A Computationally Efficient Hypothesis Testing Method for Epistasis Analysis using Multifactor Dimensionality Reduction 
Genetic epidemiology  2009;33(1):87-94.
Multifactor dimensionality reduction (MDR) was developed as a nonparametric and model-free data mining method for detecting, characterizing, and interpreting epistasis in the absence of significant main effects in genetic and epidemiologic studies of complex traits such as disease susceptibility. The goal of MDR is to change the representation of the data using a constructive induction algorithm to make nonadditive interactions easier to detect using any classification method such as naïve Bayes or logistic regression. Traditionally, MDR constructed variables have been evaluated with a naïve Bayes classifier that is combined with 10-fold cross validation to obtain an estimate of predictive accuracy or generalizability of epistasis models. Traditionally, we have used permutation testing to statistically evaluate the significance of models obtained through MDR. The advantage of permutation testing is that it controls for false-positives due to multiple testing. The disadvantage is that permutation testing is computationally expensive. This is in an important issue that arises in the context of detecting epistasis on a genome-wide scale. The goal of the present study was to develop and evaluate several alternatives to large-scale permutation testing for assessing the statistical significance of MDR models. Using data simulated from 70 different epistasis models, we compared the power and type I error rate of MDR using a 1000-fold permutation test with hypothesis testing using an extreme value distribution (EVD). We find that this new hypothesis testing method provides a reasonable alternative to the computationally expensive 1000-fold permutation test and is 50 times faster. We then demonstrate this new method by applying it to a genetic epidemiology study of bladder cancer susceptibility that was previously analyzed using MDR and assessed using a 1000-fold permutation test.
PMCID: PMC2700860  PMID: 18671250
Extreme Value Distribution; Permutation Testing; Power; Type I Error; Bladder Cancer; Data Mining
9.  Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases 
BMC Bioinformatics  2003;4:28.
Appropriate definition of neural network architecture prior to data analysis is crucial for successful data mining. This can be challenging when the underlying model of the data is unknown. The goal of this study was to determine whether optimizing neural network architecture using genetic programming as a machine learning strategy would improve the ability of neural networks to model and detect nonlinear interactions among genes in studies of common human diseases.
Using simulated data, we show that a genetic programming optimized neural network approach is able to model gene-gene interactions as well as a traditional back propagation neural network. Furthermore, the genetic programming optimized neural network is better than the traditional back propagation neural network approach in terms of predictive ability and power to detect gene-gene interactions when non-functional polymorphisms are present.
This study suggests that a machine learning strategy for optimizing neural network architecture may be preferable to traditional trial-and-error approaches for the identification and characterization of gene-gene interactions in common, complex human diseases.
PMCID: PMC183838  PMID: 12846935

Results 1-9 (9)