Cells dynamically adapt their gene expression patterns in response to various stimuli. This response is orchestrated into a number of gene expression modules consisting of co-regulated genes. A growing pool of publicly available microarray datasets allows the identification of modules by monitoring expression changes over time. These time-series datasets can be searched for gene expression modules by one of the many clustering methods published to date. For an integrative analysis, several time-series datasets can be joined into a three-dimensional gene-condition-time dataset, to which standard clustering or biclustering methods are, however, not applicable. We thus devise a probabilistic clustering algorithm for gene-condition-time datasets.
In this work, we present the EDISA (Extended Dimension Iterative Signature Algorithm), a novel probabilistic clustering approach for 3D gene-condition-time datasets. Based on mathematical definitions of gene expression modules, the EDISA samples initial modules from the dataset which are then refined by removing genes and conditions until they comply with the module definition. A subsequent extension step ensures gene and condition maximality. We applied the algorithm to a synthetic dataset and were able to successfully recover the implanted modules over a range of background noise intensities. Analysis of microarray datasets has lead us to define three biologically relevant module types: 1) We found modules with independent response profiles to be the most prevalent ones. These modules comprise genes which are co-regulated under several conditions, yet with a different response pattern under each condition. 2) Coherent modules with similar responses under all conditions occurred frequently, too, and were often contained within these modules. 3) A third module type, which covers a response specific to a single condition was also detected, but rarely. All of these modules are essentially different types of biclusters.
We successfully applied the EDISA to different 3D datasets. While previous studies were mostly aimed at detecting coherent modules only, our results show that coherent responses are often part of a more general module type with independent response profiles under different conditions. Our approach thus allows for a more comprehensive view of the gene expression response. After subsequent analysis of the resulting modules, the EDISA helped to shed light on the global organization of transcriptional control. An implementation of the algorithm is available at http://www-ra.informatik.uni-tuebingen.de/software/IAGEN/.
High throughput signature sequencing holds many promises, one of which is the ready identification of in vivo transcription factor binding sites, histone modifications, changes in chromatin structure and patterns of DNA methylation across entire genomes. In these experiments, chromatin immunoprecipitation is used to enrich for particular DNA sequences of interest and signature sequencing is used to map the regions to the genome (ChIP-Seq). Elucidation of these sites of DNA-protein binding/modification are proving instrumental in reconstructing networks of gene regulation and chromatin remodelling that direct development, response to cellular perturbation, and neoplastic transformation.
Here we present a package of algorithms and software that makes use of control input data to reduce false positives and estimate confidence in ChIP-Seq peaks. Several different methods were compared using two simulated spike-in datasets. Use of control input data and a normalized difference score were found to more than double the recovery of ChIP-Seq peaks at a 5% false discovery rate (FDR). Moreover, both a binomial p-value/q-value and an empirical FDR were found to predict the true FDR within 2–3 fold and are more reliable estimators of confidence than a global Poisson p-value. These methods were then used to reanalyze Johnson et al.'s neuron-restrictive silencer factor (NRSF) ChIP-Seq data without relying on extensive qPCR validated NRSF sites and the presence of NRSF binding motifs for setting thresholds.
The methods developed and tested here show considerable promise for reducing false positives and estimating confidence in ChIP-Seq data without any prior knowledge of the chIP target. They are part of a larger open source package freely available from http://useq.sourceforge.net/.
Comprehensive understanding of biological systems requires efficient and systematic assimilation of high-throughput datasets in the context of the existing knowledge base. A major limitation in the field of proteomics is the lack of an appropriate software platform that can synthesize a large number of experimental datasets in the context of the existing knowledge base. Here, we describe a software platform, termed PROTEOME-3D, that utilizes three essential features for systematic analysis of proteomics data: creation of a scalable, queryable, customized database for identified proteins from published literature; graphical tools for displaying proteome landscapes and trends from multiple large-scale experiments; and interactive data analysis that facilitates identification of crucial networks and pathways. Thus, PROTEOME-3D offers a standardized platform to analyze high-throughput experimental datasets for the identification of crucial players in co-regulated pathways and cellular processes.
Motivation: High-throughput measurement techniques for metabolism and gene expression provide a wealth of information for the identification of metabolic network models. Yet, missing observations scattered over the dataset restrict the number of effectively available datapoints and make classical regression techniques inaccurate or inapplicable. Thorough exploitation of the data by identification techniques that explicitly cope with missing observations is therefore of major importance.
Results: We develop a maximum-likelihood approach for the estimation of unknown parameters of metabolic network models that relies on the integration of statistical priors to compensate for the missing data. In the context of the linlog metabolic modeling framework, we implement the identification method by an Expectation-Maximization (EM) algorithm and by a simpler direct numerical optimization method. We evaluate performance of our methods by comparison to existing approaches, and show that our EM method provides the best results over a variety of simulated scenarios. We then apply the EM algorithm to a real problem, the identification of a model for the Escherichia coli central carbon metabolism, based on challenging experimental data from the literature. This leads to promising results and allows us to highlight critical identification issues.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Mass spectrometry-based protein identification methods are fundamental to proteomics. Biological experiments are usually performed in replicates and proteomic analyses generate huge datasets which need to be integrated and quantitatively analyzed. The Sequest™ search algorithm is a commonly used algorithm for identifying peptides and proteins from two dimensional liquid chromatography electrospray ionization tandem mass spectrometry (2-D LC ESI MS2) data. A number of proteomic pipelines that facilitate high throughput 'post data acquisition analysis' are described in the literature. However, these pipelines need to be updated to accommodate the rapidly evolving data analysis methods. Here, we describe a proteomic data analysis pipeline that specifically addresses two main issues pertinent to protein identification and differential expression analysis: 1) estimation of the probability of peptide and protein identifications and 2) non-parametric statistics for protein differential expression analysis. Our proteomic analysis workflow analyzes replicate datasets from a single experimental paradigm to generate a list of identified proteins with their probabilities and significant changes in protein expression using parametric and non-parametric statistics.
The input for our workflow is Bioworks™ 3.2 Sequest (or a later version, including cluster) output in XML format. We use a decoy database approach to assign probability to peptide identifications. The user has the option to select "quality thresholds" on peptide identifications based on the P value. We also estimate probability for protein identification. Proteins identified with peptides at a user-specified threshold value from biological experiments are grouped as either control or treatment for further analysis in ProtQuant. ProtQuant utilizes a parametric (ANOVA) method, for calculating differences in protein expression based on the quantitative measure ΣXcorr. Alternatively ProtQuant output can be further processed using non-parametric Monte-Carlo resampling statistics to calculate P values for differential expression. Correction for multiple testing of ANOVA and resampling P values is done using Benjamini and Hochberg's method. The results of these statistical analyses are then combined into a single output file containing a comprehensive protein list with probabilities and differential expression analysis, associated P values, and resampling statistics.
For biologists carrying out proteomics by mass spectrometry, our workflow facilitates automated, easy to use analyses of Bioworks (3.2 or later versions) data. All the methods used in the workflow are peer-reviewed and as such the results of our workflow are compliant with proteomic data submission guidelines to public proteomic data repositories including PRIDE. Our workflow is a necessary intermediate step that is required to link proteomics data to biological knowledge for generating testable hypotheses.
Universal principles underlying network science, and their ever-increasing applications in biomedicine, underscore the unprecedented capacity of systems biology based strategies to synthesize and resolve massive high throughput generated datasets. Enabling previously unattainable comprehension of biological complexity, systems approaches have accelerated progress in elucidating disease prediction, progression, and outcome. Applied to the spectrum of states spanning health and disease, network proteomics establishes a collation, integration, and prioritization algorithm to guide mapping and decoding of proteome landscapes from large-scale raw data. Providing unparalleled deconvolution of protein lists into global interactomes, integrative systems proteomics enables objective, multi-modal interpretation at molecular, pathway, and network scales, merging individual molecular components, their plurality of interactions, and functional contributions for systems comprehension. As such, network systems approaches are increasingly exploited for objective interpretation of cardiovascular proteomics studies. Here, we highlight network systems proteomic analysis pipelines for integration and biological interpretation through protein cartography, ontological categorization, pathway and functional enrichment and complex network analysis.
ATP-sensitive K+ channel; bioinformatics; complex network analysis; KATP channel; Kir6.2; genetics; heart disease; metabolism; network biology; proteome; regenerative medicine; SUR2A; stem cells; systems biology
Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included.
The cell uses complex regulatory networks to modulate the expression of genes in response to changes in cellular and environmental conditions. The transcript level of a gene is directly affected by the binding of transcriptional regulators to DNA motifs in its promoter sequence. Therefore, both expression levels of transcription factors and other regulatory proteins as well as sequence information in the promoters contribute to transcriptional gene regulation. In this study, we describe a new computational strategy for learning gene regulatory programs from gene expression data based on the MEDUSA algorithm. We learn a model that predicts differential expression of target genes from the expression levels of regulators, the presence of DNA motifs in promoter sequences, and binding data for transcription factors. Unlike many previous approaches, we do not assume that genes are regulated in clusters, and we learn DNA motifs de novo from promoter sequences as an integrated part of our algorithm. We use MEDUSA to produce a global map of the yeast oxygen and heme regulatory network. To demonstrate that MEDUSA can reveal detailed information about regulatory mechanisms, we perform biochemical experiments to confirm the predicted regulators for an important hypoxia gene.
Motivation: Systematic and scalable parameter estimation is a key to construct complex gene regulatory models and to ultimately facilitate an integrative systems biology approach to quantitatively understand the molecular mechanisms underpinning gene regulation.
Results: Here, we report a novel framework for efficient and scalable parameter estimation that focuses specifically on modeling of gene circuits. Exploiting the structure commonly found in gene circuit models, this framework decomposes a system of coupled rate equations into individual ones and efficiently integrates them separately to reconstruct the mean time evolution of the gene products. The accuracy of the parameter estimates is refined by iteratively increasing the accuracy of numerical integration using the model structure. As a case study, we applied our framework to four gene circuit models with complex dynamics based on three synthetic datasets and one time series microarray data set. We compared our framework to three state-of-the-art parameter estimation methods and found that our approach consistently generated higher quality parameter solutions efficiently. Although many general-purpose parameter estimation methods have been applied for modeling of gene circuits, our results suggest that the use of more tailored approaches to use domain-specific information may be a key to reverse engineering of complex biological systems.
Supplementary data are available at Bioinformatics online.
High throughput gene expression experiments yield large amounts of data that can augment our understanding of disease processes, in addition to classifying samples. Here we present new paradigms of data Separation based on construction of transcriptional regulatory networks for normal and abnormal cells using sequence predictions, literature based data and gene expression studies. We analyzed expression datasets from a number of diseased and normal cells, including different types of acute leukemia, and breast cancer with variable clinical outcome.
We constructed sample-specific regulatory networks to identify links between transcription factors (TFs) and regulated genes that differentiate between healthy and diseased states. This approach carries the advantage of identifying key transcription factor-gene pairs with differential activity between healthy and diseased states rather than merely using gene expression profiles, thus alluding to processes that may be involved in gene deregulation. We then generalized this approach by studying simultaneous changes in functionality of multiple regulatory links pointing to a regulated gene or emanating from one TF (or changes in gene centrality defined by its in-degree or out-degree measures, respectively). We found that samples can often be separated based on these measures of gene centrality more robustly than using individual links.
We examined distributions of distances (the number of links needed to traverse the path between each pair of genes) in the transcriptional networks for gene subsets whose collective expression profiles could best separate each dataset into predefined groups. We found that genes that optimally classify samples are concentrated in neighborhoods in the gene regulatory networks. This suggests that genes that are deregulated in diseased states exhibit a remarkable degree of connectivity.
Transcription factor-regulated gene links and centrality of genes on transcriptional networks can be used to differentiate between cell types. Transcriptional network blueprints can be used as a basis for further research into gene deregulation in diseased states.
Most current methods for gene regulatory network identification lead to the inference of steady-state networks, that is, networks prevalent over all times, a hypothesis which has been challenged. There has been a need to infer and represent networks in a dynamic, that is, time-varying fashion, in order to account for different cellular states affecting the interactions amongst genes. In this work, we present an approach, regime-SSM, to understand gene regulatory networks within such a dynamic setting. The approach uses a clustering method based on these underlying dynamics, followed by system identification using a state-space model for each learnt cluster—to infer a network adjacency matrix. We finally indicate our results on the mouse embryonic kidney dataset as well as the T-cell activation-based expression dataset and demonstrate conformity with reported experimental evidence.
Predicting protein complexes from protein-protein interaction data is becoming a fundamental problem in computational biology. The identification and characterization of protein complexes implicated are crucial to the understanding of the molecular events under normal and abnormal physiological conditions. On the other hand, large datasets of experimentally detected protein-protein interactions were determined using High-throughput experimental techniques. However, experimental data is usually liable to contain a large number of spurious interactions. Therefore, it is essential to validate these interactions before exploiting them to predict protein complexes.
In this paper, we propose a novel graph mining algorithm (PEWCC) to identify such protein complexes. Firstly, the algorithm assesses the reliability of the interaction data, then predicts protein complexes based on the concept of weighted clustering coefficient. To demonstrate the effectiveness of the proposed method, the performance of PEWCC was compared to several methods. PEWCC was able to detect more matched complexes than any of the state-of-the-art methods with higher quality scores.
The higher accuracy achieved by PEWCC in detecting protein complexes is a valid argument in favor of the proposed method. The datasets and programs are freely available at
The reverse engineering of transcription regulatory networks from expression data is gaining large interest in the bioinformatics community. An important family of inference techniques is represented by algorithms based on information theoretic measures which rely on the computation of pairwise mutual information. This paper aims to study the impact of the entropy estimator on the quality of the inferred networks. This is done by means of a comprehensive study which takes into consideration three state-of-the-art mutual information algorithms: ARACNE, CLR, and MRNET. Two different setups are considered in this work. The first one considers a set of 12 synthetically generated datasets to compare 8 different entropy estimators and three network inference algorithms. The two methods emerging as the most accurate ones from the first set of experiments are the MRNET method combined with the newly applied Spearman correlation and the CLR method combined with the Pearson correlation. The validation of these two techniques is then carried out on a set of 10 public domain microarray datasets measuring the transcriptional regulatory activity in the yeast organism.
The widespread use of high-throughput experimental assays designed to measure the entire complement of a cell's genes or gene products has led to vast stores of data that are extremely plentiful in terms of the number of items they can measure in a single sample, yet often sparse in the number of samples per experiment due to their high cost. This often leads to datasets where the number of treatment levels or time points sampled is limited, or where there are very small numbers of technical and/or biological replicates. Here we introduce a novel algorithm to quantify the uncertainty in the unmeasured intervals between biological measurements taken across a set of quantitative treatments. The algorithm provides a probabilistic distribution of possible gene expression values within unmeasured intervals, based on a plausible biological constraint. We show how quantification of this uncertainty can be used to guide researchers in further data collection by identifying which samples would likely add the most information to the system under study. Although the context for developing the algorithm was gene expression measurements taken over a time series, the approach can be readily applied to any set of quantitative systems biology measurements taken following quantitative (i.e. non-categorical) treatments. In principle, the method could also be applied to combinations of treatments, in which case it could greatly simplify the task of exploring the large combinatorial space of future possible measurements.
Understanding the dynamic mechanism behind the transcriptional organization of genes in response to varying environmental conditions requires time-dependent data. The dynamic transcriptional response obtained by real-time RT-qPCR experiments could only be correctly interpreted if suitable reference genes are used in the analysis. The lack of available studies on the identification of candidate reference genes in dynamic gene expression studies necessitates the identification and the verification of a suitable gene set for the analysis of transient gene expression response.
In this study, a candidate reference gene set for RT-qPCR analysis of dynamic transcriptional changes in Saccharomyces cerevisiae was determined using 31 different publicly available time series transcriptome datasets. Ten of the twelve candidates (TPI1, FBA1, CCW12, CDC19, ADH1, PGK1, GCN4, PDC1, RPS26A and ARF1) we identified were not previously reported as potential reference genes. Our method also identified the commonly used reference genes ACT1 and TDH3. The most stable reference genes from this pool were determined as TPI1, FBA1, CDC19 and ACT1 in response to a perturbation in the amount of available glucose and as FBA1, TDH3, CCW12 and ACT1 in response to a perturbation in the amount of available ammonium. The use of these newly proposed gene sets outperformed the use of common reference genes in the determination of dynamic transcriptional response of the target genes, HAP4 and MEP2, in response to relaxation from glucose and ammonium limitations, respectively.
A candidate reference gene set to be used in dynamic real-time RT-qPCR expression profiling in yeast was proposed for the first time in the present study. Suitable pools of stable reference genes to be used under different experimental conditions could be selected from this candidate set in order to successfully determine the expression profiles for the genes of interest.
Biology is meaningful and important to identify cytokines and investigate their various functions and biochemical mechanisms. However, several issues remain, including the large scale of benchmark datasets, serious imbalance of data, and discovery of new gene families. In this paper, we employ the machine learning approach based on a novel ensemble classifier to predict cytokines. We directly selected amino acids sequences as research objects. First, we pretreated the benchmark data accurately. Next, we analyzed the physicochemical properties and distribution of whole amino acids and then extracted a group of 120-dimensional (120D) valid features to represent sequences. Third, in the view of the serious imbalance in benchmark datasets, we utilized a sampling approach based on the synthetic minority oversampling technique algorithm and K-means clustering undersampling algorithm to rebuild the training set. Finally, we built a library for dynamic selection and circulating combination based on clustering (LibD3C) and employed the new training set to realize cytokine classification. Experiments showed that the geometric mean of sensitivity and specificity obtained through our approach is as high as 93.3%, which proves that our approach is effective for identifying cytokines.
Inferring the topology of a gene-regulatory network (GRN) from genome-scale time-series measurements of transcriptional change has proved useful for disentangling complex biological processes. To address the challenges associated with this inference, a number of competing approaches have previously been used, including examples from information theory, Bayesian and dynamic Bayesian networks (DBNs), and ordinary differential equation (ODE) or stochastic differential equation. The performance of these competing approaches have previously been assessed using a variety of in silico and in vivo datasets. Here, we revisit this work by assessing the performance of more recent network inference algorithms, including a novel non-parametric learning approach based upon nonlinear dynamical systems. For larger GRNs, containing hundreds of genes, these non-parametric approaches more accurately infer network structures than do traditional approaches, but at significant computational cost. For smaller systems, DBNs are competitive with the non-parametric approaches with respect to computational time and accuracy, and both of these approaches appear to be more accurate than Granger causality-based methods and those using simple ODEs models.
gene-regulatory networks; inference; gene expression
Motivation: Advances in high-throughput sequencing have resulted in rapid growth in large, high-quality datasets including those arising from transcription factor (TF) ChIP-seq experiments. While there are many existing tools for discovering TF binding site motifs in such datasets, most web-based tools cannot directly process such large datasets.
Results: The MEME-ChIP web service is designed to analyze ChIP-seq ‘peak regions’—short genomic regions surrounding declared ChIP-seq ‘peaks’. Given a set of genomic regions, it performs (i) ab initio motif discovery, (ii) motif enrichment analysis, (iii) motif visualization, (iv) binding affinity analysis and (v) motif identification. It runs two complementary motif discovery algorithms on the input data—MEME and DREME—and uses the motifs they discover in subsequent visualization, binding affinity and identification steps. MEME-ChIP also performs motif enrichment analysis using the AME algorithm, which can detect very low levels of enrichment of binding sites for TFs with known DNA-binding motifs. Importantly, unlike with the MEME web service, there is no restriction on the size or number of uploaded sequences, allowing very large ChIP-seq datasets to be analyzed. The analyses performed by MEME-ChIP provide the user with a varied view of the binding and regulatory activity of the ChIP-ed TF, as well as the possible involvement of other DNA-binding TFs.
Availability: MEME-ChIP is available as part of the MEME Suite at http://meme.nbcr.net.
Supplementary information: Supplementary data are available at Bioinformatics online.
There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 106 points that are often generated by high throughput experiments.
To circumvent these limitations, we developed a new, unsupervised density contour clustering algorithm, called Misty Mountain, that is based on percolation theory and that efficiently analyzes large data sets. The approach can be envisioned as a progressive top-down removal of clouds covering a data histogram relief map to identify clusters by the appearance of statistically distinct peaks and ridges. This is a parallel clustering method that finds every cluster after analyzing only once the cross sections of the histogram. The overall run time for the composite steps of the algorithm increases linearly by the number of data points. The clustering of 106 data points in 2D data space takes place within about 15 seconds on a standard laptop PC. Comparison of the performance of this algorithm with other state of the art automated flow cytometry gating methods indicate that Misty Mountain provides substantial improvements in both run time and in the accuracy of cluster assignment.
Misty Mountain is fast, unbiased for cluster shape, identifies stable clusters and is robust to noise. It provides a useful, general solution for multidimensional clustering problems. We demonstrate its suitability for automated gating of flow cytometry data.
We propose a novel method called Partitioning based Adaptive Irrelevant Feature Eliminator (PAIFE) for dimensionality reduction in high-dimensional biomedical datasets. PAIFE evaluates feature-target relationships over not only a whole dataset, but also the partitioned subsets and is extremely effective in identifying features whose relevancies to the target are conditional on certain other features. PAIFE adaptively employs the most appropriate feature evaluation strategy, statistical test and parameter instantiation. We envision PAIFE to be used as a third-party data pre-processing tool for dimensionality reduction of high-dimensional clinical datasets. Experiments on synthetic datasets showed that PAIFE consistently outperformed state-of-the-art feature selection methods in removing irrelevant features while retaining relevant features. Experiments on genomic and proteomic datasets demonstrated that PAIFE was able to remove significant numbers of irrelevant features in real-world biomedical datasets. Classification models constructed from the retained features either matched or improved the classification performances of the models constructed using all features.
Inference of gene-regulatory networks (GRNs) is important for understanding behaviour and potential treatment of biological systems. Knowledge about GRNs gained from transcriptome analysis can be increased by multiple experiments and/or multiple stimuli. Since GRNs are complex and dynamical, appropriate methods and algorithms are needed for constructing models describing these dynamics. Algorithms based on heuristic approaches reduce the effort in parameter identification and computation time.
The NetGenerator V2.0 algorithm, a heuristic for network inference, is proposed and described. It automatically generates a system of differential equations modelling structure and dynamics of the network based on time-resolved gene expression data. In contrast to a previous version, the inference considers multi-stimuli multi-experiment data and contains different methods for integrating prior knowledge. The resulting significant changes in the algorithmic procedures are explained in detail. NetGenerator is applied to relevant benchmark examples evaluating the inference for data from experiments with different stimuli. Also, the underlying GRN of chondrogenic differentiation, a real-world multi-stimulus problem, is inferred and analysed.
NetGenerator is able to determine the structure and parameters of GRNs and their dynamics. The new features of the algorithm extend the range of possible experimental set-ups, results and biological interpretations. Based upon benchmarks, the algorithm provides good results in terms of specificity, sensitivity, efficiency and model fit.
Gene-regulatory networks; Network inference; Heuristic algorithm; ODE; NetGenerator
Elucidation of new biomarkers and potential drug targets from high-throughput profiling data is a challenging task due to a limited number of available biological samples and questionable reproducibility of differential changes in cross-dataset comparisons. In this paper we propose a novel computational approach for drug and biomarkers discovery using comprehensive analysis of multiple expression profiling datasets.
The new method relies on aggregation of individual profiling experiments combined with leave-one-dataset-out validation approach. Aggregated datasets were studied using Sub-Network Enrichment Analysis algorithm (SNEA) to find consistent statistically significant key regulators within the global literature-extracted expression regulation network. These regulators were linked to the consistent differentially expressed genes.
We have applied our approach to several publicly available human muscle gene expression profiling datasets related to Duchenne muscular dystrophy (DMD). In order to detect both enhanced and repressed processes we considered up- and down-regulated genes separately. Applying the proposed approach to the regulators search we discovered the disturbance in the activity of several muscle-related transcription factors (e.g. MYOG and MYOD1), regulators of inflammation, regeneration, and fibrosis. Almost all SNEA-derived regulators of down-regulated genes (e.g. AMPK, TORC2, PPARGC1A) correspond to a single common pathway important for fast-to-slow twitch fiber type transition. We hypothesize that this process can affect the severity of DMD symptoms, making corresponding regulators and downstream genes valuable candidates for being potential drug targets and exploratory biomarkers.
Comparison of gene expression in diseased and normal tissue is a powerful tool of studying processes involved in pathogenesis and searching for potential drug targets and biomarkers of the disease's progression and treatment outcome. We have developed a novel approach for systematic knowledge-driven analysis of gene expression profiling data, which can suggest the underlying cause of the observed differential expression by identifying which expression regulators might be involved. These regulators can not only be the promising subjects of further investigation, but also potential drug targets, as normalization of their activity might alleviate some of the disease's symptoms. The targets downstream of suggested regulators can be proposed as exploratory biomarkers in disease treatment and prognosis. We used our approach to analyze public gene expression datasets of Duchenne muscular dystrophy – a progressive inherited disease in males. Some of the regulators and biomarkers that we found were already investigated in the context of DMD, while some of them were not yet studied and may be of interest for biological and clinical studies.
The capacity of microorganisms to respond to variable external conditions requires a coordination of environment-sensing mechanisms and decision-making regulatory circuits. Here, we seek to understand the interplay between these two processes by combining high-throughput measurement of time-dependent mRNA profiles with a novel computational approach that searches for key genetic triggers of transcriptional changes. Our approach helped us understand the regulatory strategies of a respiratorily versatile bacterium with promising bioenergy and bioremediation applications, Shewanella oneidensis, in minimal and rich media. By comparing expression profiles across these two conditions, we unveiled components of the transcriptional program that depend mainly on the growth phase. Conversely, by integrating our time-dependent data with a previously available large compendium of static perturbation responses, we identified transcriptional changes that cannot be explained solely by internal network dynamics, but are rather triggered by specific genes acting as key mediators of an environment-dependent response. These transcriptional triggers include known and novel regulators that respond to carbon, nitrogen and oxygen limitation. Our analysis suggests a sequence of physiological responses, including a coupling between nitrogen depletion and glycogen storage, partially recapitulated through dynamic flux balance analysis, and experimentally confirmed by metabolite measurements. Our approach is broadly applicable to other systems.
State Space Model (SSM) is a relatively new approach to inferring gene regulatory networks. It requires less computational time than Dynamic Bayesian Networks (DBN). There are two types of variables in the linear SSM, observed variables and hidden variables. SSM uses an iterative method, namely Expectation-Maximization, to infer regulatory relationships from microarray datasets. The hidden variables cannot be directly observed from experiments. How to determine the number of hidden variables has a significant impact on the accuracy of network inference. In this study, we used SSM to infer Gene regulatory networks (GRNs) from synthetic time series datasets, investigated Bayesian Information Criterion (BIC) and Principle Component Analysis (PCA) approaches to determining the number of hidden variables in SSM, and evaluated the performance of SSM in comparison with DBN.
True GRNs and synthetic gene expression datasets were generated using GeneNetWeaver. Both DBN and linear SSM were used to infer GRNs from the synthetic datasets. The inferred networks were compared with the true networks.
Our results show that inference precision varied with the number of hidden variables. For some regulatory networks, the inference precision of DBN was higher but SSM performed better in other cases. Although the overall performance of the two approaches is compatible, SSM is much faster and capable of inferring much larger networks than DBN.
This study provides useful information in handling the hidden variables and improving the inference precision.
Handling genotype data typed at hundreds of thousands of loci is very time-consuming and it is no exception for population structure inference. Therefore, we propose to apply PCA to the genotype data of a population, select the significant principal components using the Tracy-Widom distribution, and assign the individuals to one or more subpopulations using generic clustering algorithms.
We investigated K-means, soft K-means and spectral clustering and made comparison to STRUCTURE, a model-based algorithm specifically designed for population structure inference. Moreover, we investigated methods for predicting the number of subpopulations in a population. The results on four simulated datasets and two real datasets indicate that our approach performs comparably well to STRUCTURE. For the simulated datasets, STRUCTURE and soft K-means with BIC produced identical predictions on the number of subpopulations. We also showed that, for real dataset, BIC is a better index than likelihood in predicting the number of subpopulations.
Our approach has the advantage of being fast and scalable, while STRUCTURE is very time-consuming because of the nature of MCMC in parameter estimation. Therefore, we suggest choosing the proper algorithm based on the application of population structure inference.
Genomic experiments produce multiple views of biological systems, among them DNA sequence and copy number variation, mRNA and protein abundance. Understanding these systems requires integrated bioinformatic analysis. Public databases such as Ensembl provide relationships and mappings between the relevant sets of probe and target molecules. However, the relationships can be biologically complex and the content of the databases is dynamic. We demonstrate how to use the computational environment R to integrate and jointly analyse experimental datasets, employing BioMart web services to provide the molecule mappings. We also discuss typical problems that are encountered in making gene to transcript to protein mappings. The approach provides a flexible, programmable and reproducible basis for state-of-the-art bioinformatic data integration.
Data Integration; Mapping; Identifiers; Ensembl; BioMart; Bioconductor