The structure of proteins may change as a result of the inherent flexibility of some protein regions. We develop and explore probabilistic machine learning methods for predicting a continuum secondary structure, i.e. assigning probabilities to the conformational states of a residue. We train our methods using data derived from high-quality NMR models.
Several probabilistic models not only successfully estimate the continuum secondary structure, but also provide a categorical output on par with models directly trained on categorical data. Importantly, models trained on the continuum secondary structure are also better than their categorical counterparts at identifying the conformational state for structurally ambivalent residues.
Cascaded probabilistic neural networks trained on the continuum secondary structure exhibit better accuracy in structurally ambivalent regions of proteins, while sustaining an overall classification accuracy on par with standard, categorical prediction methods.
Modelling is fundamental to many fields of science and engineering. A model can be thought of as a representation of possible data one could predict from a system. The probabilistic approach to modelling uses probability theory to express all aspects of uncertainty in the model. The probabilistic approach is synonymous with Bayesian modelling, which simply uses the rules of probability theory in order to make predictions, compare alternative models, and learn model parameters and structure from data. This simple and elegant framework is most powerful when coupled with flexible probabilistic models. Flexibility is achieved through the use of Bayesian non-parametrics. This article provides an overview of probabilistic modelling and an accessible survey of some of the main tools in Bayesian non-parametrics. The survey covers the use of Bayesian non-parametrics for modelling unknown functions, density estimation, clustering, time-series modelling, and representing sparsity, hierarchies, and covariance structure. More specifically, it gives brief non-technical overviews of Gaussian processes, Dirichlet processes, infinite hidden Markov models, Indian buffet processes, Kingman’s coalescent, Dirichlet diffusion trees and Wishart processes.
probabilistic modelling; Bayesian statistics; non-parametrics; machine learning
An important milestone in revealing cells' functions is to build a comprehensive understanding of transcriptional regulation processes. These processes are largely regulated by transcription factors (TFs) binding to DNA sites. Several TF binding site (TFBS) prediction methods have been developed, but they usually model binding of a single TF at a time albeit few methods for predicting binding of multiple TFs also exist. In this article, we propose a probabilistic model that predicts binding of several TFs simultaneously. Our method explicitly models the competitive binding between TFs and uses the prior knowledge of existing protein–protein interactions (PPIs), which mimics the situation in the nucleus. Modeling DNA binding for multiple TFs improves the accuracy of binding site prediction remarkably when compared with other programs and the cases where individual binding prediction results of separate TFs have been combined. The traditional TFBS prediction methods usually predict overwhelming number of false positives. This lack of specificity is overcome remarkably with our competitive binding prediction method. In addition, previously unpredictable binding sites can be detected with the help of PPIs. Source codes are available at http://www.cs.tut.fi/∼harrila/.
Inferring regulatory networks from experimental data via probabilistic graphical models is a popular framework to gain insights into biological systems. However, the inherent noise in experimental data coupled with a limited sample size reduces the performance of network reverse engineering. Prior knowledge from existing sources of biological information can address this low signal to noise problem by biasing the network inference towards biologically plausible network structures. Although integrating various sources of information is desirable, their heterogeneous nature makes this task challenging. We propose two computational methods to incorporate various information sources into a probabilistic consensus structure prior to be used in graphical model inference. Our first model, called Latent Factor Model (LFM), assumes a high degree of correlation among external information sources and reconstructs a hidden variable as a common source in a Bayesian manner. The second model, a Noisy-OR, picks up the strongest support for an interaction among information sources in a probabilistic fashion. Our extensive computational studies on KEGG signaling pathways as well as on gene expression data from breast cancer and yeast heat shock response reveal that both approaches can significantly enhance the reconstruction accuracy of Bayesian Networks compared to other competing methods as well as to the situation without any prior. Our framework allows for using diverse information sources, like pathway databases, GO terms and protein domain data, etc. and is flexible enough to integrate new sources, if available.
Probabilistic choice has been attracting attention in psychopharmacology and neuroeconomics. Several parametric models have been proposed for probabilistic choice; entropy model, Prelec's probability weight function, and hyperbola-like probability discounting functions.
In order to examine (i) fitness of the probabilistic models to behavioral data, (ii) relationships between the parameters and psychological processes, e.g., aversion to possible non-gain in each probabilistic choice and aversion to unpredictability, we estimated the parameters and AICc (Akaike Information Criterion with small sample correction) of the probabilistic choice models by assessing the points of subjective equality at seven probability values (95%–5%). We examined both fitness of the models parametrized by utilizing AICc, and the relationships between the model parameters and equation-free parameter of aversion to possible non-gain.
Our results have shown that (i) the goodness of fitness for group data was [Entropy model>Prelec's function>General hyperbola>Simple hyperbola]; while Prelec's function best fitted individual data, (ii) aversion to possible non-gain and aversion to unpredictability are distinct psychological processes.
Entropy and Prelec models can be utilized in psychopharmacological and neuroeconomic studies of risky decision-making.
Parameter estimation for infectious disease models is important for basic understanding (e.g. to identify major transmission pathways), for forecasting emerging epidemics, and for designing control measures. Differential equation models are often used, but statistical inference for differential equations suffers from numerical challenges and poor agreement between observational data and deterministic models. Accounting for these departures via stochastic model terms requires full specification of the probabilistic dynamics, and computationally demanding estimation methods. Here, we demonstrate the utility of an alternative approach, generalized profiling, which provides robustness to violations of a deterministic model without needing to specify a complete probabilistic model. We introduce novel means for estimating the robustness parameters and for statistical inference in this framework. The methods are applied to a model for pre-vaccination measles incidence in Ontario, and we demonstrate the statistical validity of our inference through extensive simulation. The results confirm that school term versus summer drives seasonality of transmission, but we find no effects of short school breaks and the estimated basic reproductive ratio ℛ0 greatly exceeds previous estimates. The approach applies naturally to any system for which candidate differential equations are available, and avoids many challenges that have limited Monte Carlo inference for state–space models.
differential equation model; generalized profiling; state–space model; measles
One methodology that has met success to infer gene networks from gene expression data is based upon ordinary differential equations (ODE). However new types of data continue to be produced, so it is worthwhile to investigate how to integrate these new data types into the inference procedure. One such data is physical interactions between transcription factors and the genes they regulate as measured by ChIP-chip or ChIP-seq experiments. These interactions can be incorporated into the gene network inference procedure as a priori network information. In this article, we extend the ODE methodology into a general optimization framework that incorporates existing network information in combination with regularization parameters that encourage network sparsity. We provide theoretical results proving convergence of the estimator for our method and show the corresponding probabilistic interpretation also converges. We demonstrate our method on simulated network data and show that existing network information improves performance, overcomes the lack of observations, and performs well even when some of the existing network information is incorrect. We further apply our method to the core regulatory network of embryonic stem cells utilizing predicted interactions from two studies as existing network information. We show that including the prior network information constructs a more closely representative regulatory network versus when no information is provided.
Motivation Accurate knowledge of the genome-wide binding of transcription factors in a particular cell type or under a particular condition is necessary for understanding transcriptional regulation. Using epigenetic data such as histone modification and DNase I, accessibility data has been shown to improve motif-based in silico methods for predicting such binding, but this approach has not yet been fully explored.
Results We describe a probabilistic method for combining one or more tracks of epigenetic data with a standard DNA sequence motif model to improve our ability to identify active transcription factor binding sites (TFBSs). We convert each data type into a position-specific probabilistic prior and combine these priors with a traditional probabilistic motif model to compute a log-posterior odds score. Our experiments, using histone modifications H3K4me1, H3K4me3, H3K9ac and H3K27ac, as well as DNase I sensitivity, show conclusively that the log-posterior odds score consistently outperforms a simple binary filter based on the same data. We also show that our approach performs competitively with a more complex method, CENTIPEDE, and suggest that the relative simplicity of the log-posterior odds scoring method makes it an appealing and very general method for identifying functional TFBSs on the basis of DNA and epigenetic evidence.
Availability and implementation: FIMO, part of the MEME Suite software toolkit, now supports log-posterior odds scoring using position-specific priors for motif search. A web server and source code are available at http://meme.nbcr.net. Utilities for creating priors are at http://research.imb.uq.edu.au/t.bailey/SD/Cuellar2011.
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: A global map of transcription factor binding sites (TFBSs) is critical to understanding gene regulation and genome function. DNaseI digestion of chromatin coupled with massively parallel sequencing (digital genomic footprinting) enables the identification of protein-binding footprints with high resolution on a genome-wide scale. However, accurately inferring the locations of these footprints remains a challenging computational problem.
Results: We present a dynamic Bayesian network-based approach for the identification and assignment of statistical confidence estimates to protein-binding footprints from digital genomic footprinting data. The method, DBFP, allows footprints to be identified in a probabilistic framework and outperforms our previously described algorithm in terms of precision at a fixed recall. Applied to a digital footprinting data set from Saccharomyces cerevisiae, DBFP identifies 4679 statistically significant footprints within intergenic regions. These footprints are mainly located near transcription start sites and are strongly enriched for known TFBSs. Footprints containing no known motif are preferentially located proximal to other footprints, consistent with cooperative binding of these footprints. DBFP also identifies a set of statistically significant footprints in the yeast coding regions. Many of these footprints coincide with the boundaries of antisense transcripts, and the most significant footprints are enriched for binding sites of the chromatin-associated factors Abf1 and Rap1.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary material is available at Bioinformatics online.
We employ a biophysical model that accounts for the non-linear relationship between binding energy and the statistics of selected binding sites. The model includes the chemical potential of the transcription factor, non-specific binding affinity of the protein for DNA, as well as sequence-specific parameters that may include non-independent contributions of bases to the interaction. We obtain maximum likelihood estimates for all of the parameters and compare the results to standard probabilistic methods of parameter estimation. On simulated data, where the true energy model is known and samples are generated with a variety of parameter values, we show that our method returns much more accurate estimates of the true parameters and much better predictions of the selected binding site distributions. We also introduce a new high-throughput SELEX (HT-SELEX) procedure to determine the binding specificity of a transcription factor in which the initial randomized library and the selected sites are sequenced with next generation methods that return hundreds of thousands of sites. We show that after a single round of selection our method can estimate binding parameters that give very good fits to the selected site distributions, much better than standard motif identification algorithms.
The DNA binding sites of transcription factors that control gene expression are often predicted based on a collection of known or selected binding sites. The most commonly used methods for inferring the binding site pattern, or sequence motif, assume that the sites are selected in proportion to their affinity for the transcription factor, ignoring the effect of the transcription factor concentration. We have developed a new maximum likelihood approach, in a program called BEEML, that directly takes into account the transcription factor concentration as well as non-specific contributions to the binding affinity, and we show in simulation studies that it gives a much more accurate model of the transcription factor binding sites than previous methods. We also develop a new method for extracting binding sites for a transcription factor from a random pool of DNA sequences, called high-throughput SELEX (HT-SELEX), and we show that after a single round of selection BEEML can obtain an accurate model of the transcription factor binding sites.
Motivation: The inference of pre-mutation immunoglobulin (Ig) rearrangements is essential in the study of the antibody repertoires produced in response to infection, in B-cell neoplasms and in autoimmune disease. Often, there are several rearrangements that are nearly equivalent as candidates for a given Ig gene, but have different consequences in an analysis. Our aim in this article is to develop a probabilistic model of the rearrangement process and a Bayesian method for estimating posterior probabilities for the comparison of multiple plausible rearrangements.
Results: We have developed SoDA2, which is based on a Hidden Markov Model and used to compute the posterior probabilities of candidate rearrangements and to find those with the highest values among them. We validated the software on a set of simulated data, a set of clonally related sequences, and a group of randomly selected Ig heavy chains from Genbank. In most tests, SoDA2 performed better than other available software for the task. Furthermore, the output format has been redesigned, in part, to facilitate comparison of multiple solutions.
Availability: SoDA2 is available online at https://hippocrates.duhs.duke.edu/soda. Simulated sequences are available upon request.
The frequency of early fatality and the transient nature of emergency medical care mean that a single database will rarely suffice for population based injury research. Linking records from multiple data sources is therefore a promising method for injury surveillance or trauma system evaluation. The purpose of this article is to review the historical development of record linkage, provide a basic mathematical foundation, discuss some practical issues, and consider some ethical concerns.
Clerical or computer assisted deterministic record linkage methods may suffice for some applications, but probabilistic methods are particularly useful for larger studies. The probabilistic method attempts to simulate human reasoning by comparing each of several elements from the two records. The basic mathematical specifications are derived algebraically from fundamental concepts of probability, although the theory can be extended to include more advanced mathematics.
Probabilistic, deterministic, and clerical techniques may be combined in different ways depending upon the goal of the record linkage project. If a population parameter is being estimated for a purely statistical study, a completely probabilistic approach may be most efficient; for other applications, where the purpose is to make inferences about specific individuals based upon their data contained in two or more files, the need for a high positive predictive value would favor a deterministic method or a probabilistic method with careful clerical review. Whatever techniques are used, researchers must realize that the combination of data sources entails additional ethical obligations beyond the use of each source alone.
The discovery of cis-regulatory modules in metazoan genomes is crucial for understanding the connection between genes and organism diversity. It is important to quantify how comparative genomics can improve computational detection of such modules.
We run the Stubb software on the entire D. melanogaster genome, to obtain predictions of modules involved in segmentation of the embryo. Stubb uses a probabilistic model to score sequences for clustering of transcription factor binding sites, and can exploit multiple species data within the same probabilistic framework. The predictions are evaluated using publicly available gene expression data for thousands of genes, after careful manual annotation. We demonstrate that the use of a second genome (D. pseudoobscura) for cross-species comparison significantly improves the prediction accuracy of Stubb, and is a more sensitive approach than intersecting the results of separate runs over the two genomes. The entire list of predictions is made available online.
Evolutionary conservation of modules serves as a filter to improve their detection in silico. The future availability of additional fruitfly genomes therefore carries the prospect of highly specific genome-wide predictions using Stubb.
Algorithms that locate evolutionarily conserved sequences have become powerful tools for finding functional DNA elements, including transcription factor binding sites; however, most methods do not take advantage of an explicit model for the constrained evolution of functional DNA sequences.
We developed a probabilistic framework that combines an HKY85 model, which assigns probabilities to different base substitutions between species, and weight matrix models of transcription factor binding sites, which describe the probabilities of observing particular nucleotides at specific positions in the binding site. The method incorporates the phylogenies of the species under consideration and takes into account the position specific variation of transcription factor binding sites. Using our framework we assessed the suitability of alignments of genomic sequences from commonly used species as substrates for comparative genomic approaches to regulatory motif finding. We then applied this technique to Saccharomyces cerevisiae and related species by examining all possible six base pair DNA sequences (hexamers) and identifying sequences that are conserved in a significant number of promoters. By combining similar conserved hexamers we reconstructed known cis-regulatory motifs and made predictions of previously unidentified motifs. We tested one prediction experimentally, finding it to be a regulatory element involved in the transcriptional response to glucose.
The experimental validation of a regulatory element prediction missed by other large-scale motif finding studies demonstrates that our approach is a useful addition to the current suite of tools for finding regulatory motifs.
Development of high-throughput methods for measuring DNA interactions of transcription factors together with computational advances in short motif inference algorithms is expanding our understanding of transcription factor binding site motifs. The consequential growth of sequence motif data sets makes it important to systematically group and categorise regulatory motifs. It has been shown that there are familial tendencies in DNA sequence motifs that are predictive of the family of factors that binds them. Further development of methods that detect and describe familial motif trends has the potential to help in measuring the similarity of novel computational motif predictions to previously known data and sensitively detecting regulatory motifs similar to previously known ones from novel sequence.
We propose a probabilistic model for position weight matrix (PWM) sequence motif families. The model, which we call the 'metamotif' describes recurring familial patterns in a set of motifs. The metamotif framework models variation within a family of sequence motifs. It allows for simultaneous estimation of a series of independent metamotifs from input position weight matrix (PWM) motif data and does not assume that all input motif columns contribute to a familial pattern. We describe an algorithm for inferring metamotifs from weight matrix data. We then demonstrate the use of the model in two practical tasks: in the Bayesian NestedMICA model inference algorithm as a PWM prior to enhance motif inference sensitivity, and in a motif classification task where motifs are labelled according to their interacting DNA binding domain.
We show that metamotifs can be used as PWM priors in the NestedMICA motif inference algorithm to dramatically increase the sensitivity to infer motifs. Metamotifs were also successfully applied to a motif classification problem where sequence motif features were used to predict the family of protein DNA binding domains that would interact with it. The metamotif based classifier is shown to compare favourably to previous related methods. The metamotif has great potential for further use in machine learning tasks related to especially de novo computational sequence motif inference. The metamotif methods presented have been incorporated into the NestedMICA suite.
DNA sequence binding motifs for several important transcription factors happen to be self-overlapping. Many of the current regulatory site identification methods do not explicitly take into account the overlapping sites. Moreover, most methods use arbitrary thresholds and fail to provide a biophysical interpretation of statistical quantities. In addition, commonly used approaches do not include the location of a site with respect to the transcription start site (TSS) in an integrated probabilistic framework while identifying sites. Ignoring these features can lead to inaccurate predictions as well as incorrect design and interpretation of experimental results.
We have developed a tool based on a Hidden Markov Model (HMM) that identifies binding location of transcription factors with preference for self-overlapping DNA motifs by combining the effects of their alternative binding modes. Interpreting HMM parameters as biophysical quantities, this method uses the occupancy probability of a transcription factor on a DNA sequence as the discriminant function, earning the algorithm the name OHMM: Occupancy via Hidden Markov Model. OHMM learns the classification threshold by training emission probabilities using unaligned sequences containing known sites and estimating transition probabilities to reflect site density in all promoters in a genome. While identifying sites, it adjusts parameters to model site density changing with the distance from the transcription start site. Moreover, it provides guidance for designing padding sequences in gel shift experiments. In the context of binding sites to transcription factor NF-κB, we find that the occupancy probability predicted by OHMM correlates well with the binding affinity in gel shift experiments. High evolutionary conservation scores and enrichment in experimentally verified regulated genes suggest that NF-κB binding sites predicted by our method are likely to be functional.
Our method deals specifically with identifying locations with multiple overlapping binding sites by computing the local occupancy of the transcription factor. Moreover, considering OHMM as a biophysical model allows us to learn the classification threshold in a principled manner. Another feature of OHMM is that we allow transition probabilities to change with location relative to the TSS. OHMM could be used to predict physical occupancy, and provides guidance for proper design of gel-shift experiments. Based upon our predictions, new insights into NF-κB function and regulation and possible new biological roles of NF-κB were uncovered.
Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a position specific scoring matrix (PSSM), which assumes independence between binding positions. However, in many cases, this simplifying assumption does not hold. Here, we present feature motif models (FMMs), a novel probabilistic method for modeling TF–DNA interactions, based on log-linear models. Our approach uses sequence features to represent TF binding specificities, where each feature may span multiple positions. We develop the mathematical formulation of our model and devise an algorithm for learning its structural features from binding site data. We also developed a discriminative motif finder, which discovers de novo FMMs that are enriched in target sets of sequences compared to background sets. We evaluate our approach on synthetic data and on the widely used TF chromatin immunoprecipitation (ChIP) dataset of Harbison et al. We then apply our algorithm to high-throughput TF ChIP data from mouse and human, reveal sequence features that are present in the binding specificities of mouse and human TFs, and show that FMMs explain TF binding significantly better than PSSMs. Our FMM learning and motif finder software are available at http://genie.weizmann.ac.il/.
Transcription factor (TF) protein binding to its DNA target sequences is a fundamental physical interaction underlying gene regulation. Characterizing the binding specificities of TFs is essential for deducing which genes are regulated by which TFs. Recently, several high-throughput methods that measure sequences enriched for TF targets genomewide were developed. Since TFs recognize relatively short sequences, much effort has been directed at developing computational methods that identify enriched subsequences (motifs) from these sequences. However, little effort has been directed towards improving the representation of motifs. Practically, available motif finding software use the position specific scoring matrix (PSSM) model, which assumes independence between different motif positions. We present an alternative, richer model, called the feature motif model (FMM), that enables the representation of a variety of sequence features and captures dependencies that exist between binding site positions. We show how FMMs explain TF binding data better than PSSMs on both synthetic and real data. We also present a motif finder algorithm that learns FMM motifs from unaligned promoter sequences and show how de novo FMMs, learned from binding data of the human TFs c-Myc and CTCF, reveal intriguing insights about their binding specificities.
With the explosion in data generated using microarray technology by different investigators working on similar experiments, it is of interest to combine results across multiple studies.
In this article, we describe a general probabilistic framework for combining high-throughput genomic data from several related microarray experiments using mixture models. A key feature of the model is the use of latent variables that represent quantities that can be combined across diverse platforms. We consider two methods for estimation of an index termed the probability of expression (POE). The first, reported in previous work by the authors, involves Markov Chain Monte Carlo (MCMC) techniques. The second method is a faster algorithm based on the expectation-maximization (EM) algorithm. The methods are illustrated with application to a meta-analysis of datasets for metastatic cancer.
The statistical methods described in the paper are available as an R package, metaArray 1.8.1, which is at Bioconductor, whose URL is .
Cluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. A number of different approaches have been proposed for that purpose, out of which different mixture models provide a principled probabilistic framework. Cluster analysis is increasingly often supplemented with multiple data sources nowadays, and these heterogeneous information sources should be made as efficient use of as possible.
This paper presents a novel Beta-Gaussian mixture model (BGMM) for clustering genes based on Gaussian distributed and beta distributed data. The proposed BGMM can be viewed as a natural extension of the beta mixture model (BMM) and the Gaussian mixture model (GMM). The proposed BGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework, which provides a more efficient use of multiple data sources than methods that analyze different data sources separately. Moreover, BGMM provides an exceedingly flexible modeling framework since many data sources can be modeled as Gaussian or beta distributed random variables, and it can also be extended to integrate data that have other parametric distributions as well, which adds even more flexibility to this model-based clustering framework. We developed three types of estimation algorithms for BGMM, the standard expectation maximization (EM) algorithm, an approximated EM and a hybrid EM, and propose to tackle the model selection problem by well-known model selection criteria, for which we test the Akaike information criterion (AIC), a modified AIC (AIC3), the Bayesian information criterion (BIC), and the integrated classification likelihood-BIC (ICL-BIC).
Performance tests with simulated data show that combining two different data sources into a single mixture joint model greatly improves the clustering accuracy compared with either of its two extreme cases, GMM or BMM. Applications with real mouse gene expression data (modeled as Gaussian distribution) and protein-DNA binding probabilities (modeled as beta distribution) also demonstrate that BGMM can yield more biologically reasonable results compared with either of its two extreme cases. One of our applications has found three groups of genes that are likely to be involved in Myd88-dependent Toll-like receptor 3/4 (TLR-3/4) signaling cascades, which might be useful to better understand the TLR-3/4 signal transduction.
In this paper we introduce an efficient algorithm for alignment of multiple large-scale biological networks. In this scheme, we first compute a probabilistic similarity measure between nodes that belong to different networks using a semi-Markov random walk model. The estimated probabilities are further enhanced by incorporating the local and the cross-species network similarity information through the use of two different types of probabilistic consistency transformations. The transformed alignment probabilities are used to predict the alignment of multiple networks based on a greedy approach. We demonstrate that the proposed algorithm, called SMETANA, outperforms many state-of-the-art network alignment techniques, in terms of computational efficiency, alignment accuracy, and scalability. Our experiments show that SMETANA can easily align tens of genome-scale networks with thousands of nodes on a personal computer without any difficulty. The source code of SMETANA is available upon request. The source code of SMETANA can be downloaded from http://www.ece.tamu.edu/~bjyoon/SMETANA/.
Imaging plays an important role in characterizing tumors. Knowledge inferred from imaging data has the potential to improve disease management dramatically, but physicians lack a tool to easily interpret and manipulate the data. A probabilistic disease model, such as a Bayesian belief network, may be used to quantitatively model relationships found in the data. In this paper, a framework is presented that enables visual querying of an underlying disease model via a query-by-example paradigm. Users draw graphical metaphors to visually represent features in their query. The structure and parameters specified within the model guide the user through query formulation by determining when a user may draw a particular metaphor. Spatial and geometrical features are automatically extracted from the query diagram and used to instantiate the probabilistic model in order to answer the query. An implementation is described in the context of managing patients with brain tumors.
Gene regulatory network inference remains a challenging problem in systems biology despite the numerous approaches that have been proposed. When substantial knowledge on a gene regulatory network is already available, supervised network inference is appropriate. Such a method builds a binary classifier able to assign a class (Regulation/No regulation) to an ordered pair of genes. Once learnt, the pairwise classifier can be used to predict new regulations. In this work, we explore the framework of Markov Logic Networks (MLN) that combine features of probabilistic graphical models with the expressivity of first-order logic rules.
We propose to learn a Markov Logic network, e.g. a set of weighted rules that conclude on the predicate “regulates”, starting from a known gene regulatory network involved in the switch proliferation/differentiation of keratinocyte cells, a set of experimental transcriptomic data and various descriptions of genes all encoded into first-order logic. As training data are unbalanced, we use asymmetric bagging to learn a set of MLNs. The prediction of a new regulation can then be obtained by averaging predictions of individual MLNs. As a side contribution, we propose three in silico tests to assess the performance of any pairwise classifier in various network inference tasks on real datasets. A first test consists of measuring the average performance on balanced edge prediction problem; a second one deals with the ability of the classifier, once enhanced by asymmetric bagging, to update a given network. Finally our main result concerns a third test that measures the ability of the method to predict regulations with a new set of genes. As expected, MLN, when provided with only numerical discretized gene expression data, does not perform as well as a pairwise SVM in terms of AUPR. However, when a more complete description of gene properties is provided by heterogeneous sources, MLN achieves the same performance as a black-box model such as a pairwise SVM while providing relevant insights on the predictions.
The numerical studies show that MLN achieves very good predictive performance while opening the door to some interpretability of the decisions. Besides the ability to suggest new regulations, such an approach allows to cross-validate experimental data with existing knowledge.
In this article we address the problem of phylogenetic inference from nucleic acid data containing missing bases. We introduce a new effective approach, called “Probabilistic estimation of missing values” (PEMV), allowing one to estimate unknown nucleotides prior to computing the evolutionary distances between them. We show that the new method improves the accuracy of phylogenetic inference compared to the existing methods “Ignoring Missing Sites” (IMS), “Proportional Distribution of Missing and Ambiguous Bases” (PDMAB) included in the PAUP software . The proposed strategy for estimating missing nucleotides is based on probabilistic formulae developed in the framework of the Jukes-Cantor  and Kimura 2-parameter  models. The relative performances of the new method were assessed through simulations carried out with the SeqGen program , for data generation, and the Bio NJ method , for inferring phylogenies. We also compared the new method to the DNAML program  and “Matrix Representation using Parsimony” (MRP) ,  considering an example of 66 eutherian mammals originally analyzed in .
We present PEER (probabilistic estimation of expression residuals), a software package implementing statistical models that improve the sensitivity and interpretability of genetic associations in population-scale expression data. This approach builds on factor analysis methods that infer broad variance components in the measurements. PEER takes as input transcript profiles and covariates from a set of individuals, and then outputs hidden factors that explain much of the expression variability. Optionally, these factors can be interpreted as pathway or transcription factor activations by providing prior information about which genes are involved in the pathway or targeted by the factor. The inferred factors are used in genetic association analyses. First, they are treated as additional covariates, and are included in the model to increase detection power for mapping expression traits. Second, they are analyzed as phenotypes themselves to understand the causes of global expression variability. PEER extends previous related surrogate variable models and can be implemented within hours on a desktop computer.
Molecular mechanisms employed by individual multipotent cells at the point of lineage commitment remain largely uncharacterized. Current paradigms span from instructive to noise-driven mechanisms. Of considerable interest is also whether commitment involves a limited set of genes or the entire transcriptional program, and to what extent gene expression configures multiple trajectories into commitment. Importantly, the transient nature of the commitment transition confounds the experimental capture of committing cells. We develop a computational framework that simulates stochastic commitment events, and affords mechanistic exploration of the fate transition. We use a combined modeling approach guided by gene expression classifier methods that infers a time-series of stochastic commitment events from experimental growth characteristics and gene expression profiling of individual hematopoietic cells captured immediately before and after commitment. We define putative regulators of commitment and probabilistic rules of transition through machine learning methods, and employ clustering and correlation analyses to interrogate gene regulatory interactions in multipotent cells. Against this background, we develop a Monte Carlo time-series stochastic model of transcription where the parameters governing promoter status, mRNA production and mRNA decay in multipotent cells are fitted to experimental static gene expression distributions. Monte Carlo time is converted to physical time using cell culture kinetic data. Probability of commitment in time is a function of gene expression as defined by a logistic regression model obtained from experimental single-cell expression data. Our approach should be applicable to similar differentiating systems where single cell data is available. Within our system, we identify robust model solutions for the multipotent population within physiologically reasonable values and explore model predictions with regard to molecular scenarios of entry into commitment. The model suggests distinct dependencies of different commitment-associated genes on mRNA dynamics and promoter activity, which globally influence the probability of lineage commitment.
Stem cells have the capacity to both self-renew and differentiate into specialized cell lineages, thus sustaining tissue formation during embryonic development and permitting tissue homeostasis throughout adult life. Previous studies have suggested that stem cell commitment to a specific lineage may constitute a discrete event of stochastic activation of a small number of key regulator genes. Experimental exploration of this question is challenging, in face of the elusive nature of the commitment transition and due to considerable gene expression heterogeneity between cells. Here, we implement a computational model that simulates gene expression variation through time and affords the capture of in silico commitment events. This model integrates statistical analysis of experimental single-cell gene expression data with dynamical modeling methods to implement a mechanistic framework for stochastic regulation of gene transcription and a probabilistic approach for the commitment rules. Applied to blood cells, our method identifies potential commitment-associated genes, explores how their expression patterns can define alternative commitment regimes, and suggests how differences in regulation of gene expression dynamics can impact the frequency of commitment.