Sequencing datasets consist of a finite number of reads which map to specific regions of a reference genome. Most effort in modeling these datasets focuses on the detection of univariate differentially expressed genes. However, for classification, we must consider multiple genes and their interactions.
Thus, we introduce a hierarchical multivariate Poisson model (MP) and the associated optimal Bayesian classifier (OBC) for classifying samples using sequencing data. Lacking closed-form solutions, we employ a Monte Carlo Markov Chain (MCMC) approach to perform classification. We demonstrate superior or equivalent classification performance compared to typical classifiers for two synthetic datasets and over a range of classification problem difficulties. We also introduce the Bayesian minimum mean squared error (MMSE) conditional error estimator and demonstrate its computation over the feature space. In addition, we demonstrate superior or leading class performance over an RNA-Seq dataset containing two lung cancer tumor types from The Cancer Genome Atlas (TCGA).
Through model-based, optimal Bayesian classification, we demonstrate superior classification performance for both synthetic and real RNA-Seq datasets. A tutorial video and Python source code is available under an open source license at http://bit.ly/1gimnss.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0401-3) contains supplementary material, which is available to authorized users.
Classification; RNA-Seq; Model-based; Bayesian
There are two distinct issues regarding network validation: (1) Does an inferred network provide good predictions relative to experimental data? (2) Does a network inference algorithm applied within a certain network model framework yield networks that are accurate relative to some criterion of goodness? The first issue concerns scientific validation and the second concerns algorithm validation. In this paper we consider inferential validation relative to controllability; that is, if an inference procedure is applied to data generated from a gene regulatory network and an intervention procedure is designed on the inferred network, how well does it perform on the true network? The reasoning behind such a criterion is that, if our purpose is to use gene regulatory networks to design therapeutic intervention strategies, then we are not concerned with network fidelity, per se, but only with our ability to design effective interventions based on the inferred network. We will consider the problem from the perspectives of stationary control, which involves designing a control policy to be applied over time based on the current state of the network, with the decision procedure itself being time independent. The objective of a control policy is to optimally reduce the total steady-state probability mass of the undesirable states (phenotypes), which is equivalent to optimally increasing the total steady-state mass of the desirable states. Based on this criterion we compare several proposed network inference procedures. We will see that inference procedure ψ may perform poorer than inference procedure ξ relative to inferring the full network structure but perform better than ξ relative to controllability. Hence, when one is aiming at a specific application, it may be wise to use an objective-based measure of inference validity.
network inference; genetic regulatory network; control; validation; probabilistic Boolean network
Restricted Boolean networks are simplified Boolean networks that are required for either negative or positive regulations between genes. Higa et al. (BMC Proc 5:S5, 2011) proposed a three-rule algorithm to infer a restricted Boolean network from time-series data. However, the algorithm suffers from a major drawback, namely, it is very sensitive to noise. In this paper, we systematically analyze the regulatory relationships between genes based on the state switch of the target gene and propose an algorithm with which restricted Boolean networks may be inferred from time-series data. We compare the proposed algorithm with the three-rule algorithm and the best-fit algorithm based on both synthetic networks and a well-studied budding yeast cell cycle network. The performance of the algorithms is evaluated by three distance metrics: the normalized-edge Hamming distance μhame, the normalized Hamming distance of state transition μhamst, and the steady-state distribution distance μssd. Results show that the proposed algorithm outperforms the others according to both μhame and μhamst, whereas its performance according to μssd is intermediate between best-fit and the three-rule algorithms. Thus, our new algorithm is more appropriate for inferring interactions between genes from time-series data.
Restricted Boolean network; Inference; Budding yeast cell cycle
Perfect knowledge of the underlying state transition probabilities is necessary for designing an optimal intervention strategy for a given Markovian genetic regulatory network. However, in many practical situations, the complex nature of the network and/or identification costs limit the availability of such perfect knowledge. To address this difficulty, we propose to take a Bayesian approach and represent the system of interest as an uncertainty class of several models, each assigned some probability, which reflects our prior knowledge about the system. We define the objective function to be the expected cost relative to the probability distribution over the uncertainty class and formulate an optimal Bayesian robust intervention policy minimizing this cost function. The resulting policy may not be optimal for a fixed element within the uncertainty class, but it is optimal when averaged across the uncertainly class. Furthermore, starting from a prior probability distribution over the uncertainty class and collecting samples from the process over time, one can update the prior distribution to a posterior and find the corresponding optimal Bayesian robust policy relative to the posterior distribution. Therefore, the optimal intervention policy is essentially nonstationary and adaptive.
Optimal intervention; Markovian gene regulatory networks; Probabilistic Boolean networks; Uncertainty; Prior knowledge; Bayesian control
Gene set enrichment analysis (GSA) methods have been widely adopted by biological labs to analyze data and generate hypotheses for validation. Most of the existing comparison studies focus on whether the existing GSA methods can produce accurate P-values; however, practitioners are often more concerned with the correct gene-set ranking generated by the methods. The ranking performance is closely related to two critical goals associated with GSA methods: the ability to reveal biological themes and ensuring reproducibility, especially for small-sample studies. We have conducted a comprehensive simulation study focusing on the ranking performance of seven representative GSA methods. We overcome the limitation on the availability of real data sets by creating hybrid data models from existing large data sets. To build the data model, we pick a master gene from the data set to form the ground truth and artificially generate the phenotype labels. Multiple hybrid data models can be constructed from one data set and multiple data sets of smaller sizes can be generated by resampling the original data set. This approach enables us to generate a large batch of data sets to check the ranking performance of GSA methods. Our simulation study reveals that for the proposed data model, the Q2 type GSA methods have in general better performance than other GSA methods and the global test has the most robust results. The properties of a data set play a critical role in the performance. For the data sets with highly connected genes, all GSA methods suffer significantly in performance.
gene set enrichment analysis; feature ranking; data model; simulation study
Hannah Arendt, one of the foremost political philosophers of the twentieth century, has argued that it is the responsibility of educators not to leave children in their own world but instead to bring them into the adult world so that, as adults, they can carry civilization forward to whatever challenges it will face by bringing to bear the learning of the past. In the same collection of essays, she discusses the recognition by modern science that Nature is inconceivable in terms of ordinary human conceptual categories - as she writes, ‘unthinkable in terms of pure reason’. Together, these views on scientific education lead to an educational process that transforms children into adults, with a scientific adult being one who has the ability to conceptualize scientific systems independent of ordinary physical intuition. This article begins with Arendt’s basic educational and scientific points and develops from them a critique of current scientific education in conjunction with an appeal to educate young scientists in a manner that allows them to fulfill their potential ‘on the shoulders of giants’. While the article takes a general philosophical perspective, its specifics tend to be directed at biomedical education, in particular, how such education pertains to translational science.
Motivation: A common practice in biomarker discovery is to decide whether a large laboratory experiment should be carried out based on the results of a preliminary study on a small set of specimens. Consideration of the efficacy of this approach motivates the introduction of a probabilistic measure, for whether a classifier showing promising results in a small-sample preliminary study will perform similarly on a large independent sample. Given the error estimate from the preliminary study, if the probability of reproducible error is low, then there is really no purpose in substantially allocating more resources to a large follow-on study. Indeed, if the probability of the preliminary study providing likely reproducible results is small, then why even perform the preliminary study?
Results: This article introduces a reproducibility index for classification, measuring the probability that a sufficiently small error estimate on a small sample will motivate a large follow-on study. We provide a simulation study based on synthetic distribution models that possess known intrinsic classification difficulties and emulate real-world scenarios. We also set up similar simulations on four real datasets to show the consistency of results. The reproducibility indices for different distributional models, real datasets and classification schemes are empirically calculated. The effects of reporting and multiple-rule biases on the reproducibility index are also analyzed.
Availability: We have implemented in C code the synthetic data distribution model, classification rules, feature selection routine and error estimation methods. The source code is available at http://gsp.tamu.edu/Publications/supplementary/yousefi12a/. Supplementary simulation results are also included.
Supplementary data are available at Bioinformatics online.
This paper provides exact analytical expressions for the first and second moments of the true error for linear discriminant analysis (LDA) when the data are univariate and taken from two stochastic Gaussian processes. The key point is that we assume a general setting in which the sample data from each class do not need to be identically distributed or independent within or between classes. We compare the true errors of designed classifiers under the typical i.i.d. model and when the data are correlated, providing exact expressions and demonstrating that, depending on the covariance structure, correlated data can result in classifiers with either greater error or less error than when training with uncorrelated data. The general theory is applied to autoregressive and moving-average models of the first order, and it is demonstrated using real genomic data.
Linear discriminant analysis; Stochastic settings; Correlated data; Non-i.i.d data; Expected error; Gaussian processes; Auto-regressive models; Moving-average models
A key goal of systems biology and translational genomics is to utilize high-throughput measurements of cellular states to develop expression-based classifiers for discriminating among different phenotypes. Recent developments of Next Generation Sequencing (NGS) technologies can facilitate classifier design by providing expression measurements for tens of thousands of genes simultaneously via the abundance of their mRNA transcripts. Because NGS technologies result in a nonlinear transformation of the actual expression distributions, their application can result in data that are less discriminative than would be the actual expression levels themselves, were they directly observable.
Using state-of-the-art distributional modeling for the NGS processing pipeline, this paper studies how that pipeline, via the resulting nonlinear transformation, affects classification and feature selection. The effects of different factors are considered and NGS-based classification is compared to SAGE-based classification and classification directly on the raw expression data, which is represented by a very high-dimensional model previously developed for gene expression. As expected, the nonlinear transformation resulting from NGS processing diminishes classification accuracy; however, owing to a larger number of reads, NGS-based classification outperforms SAGE-based classification.
Having high numbers of reads can mitigate the degradation in classification performance resulting from the effects of NGS technologies. Hence, when performing a RNA-Seq analysis, using the highest possible coverage of the genome is recommended for the purposes of classification.
A typical small-sample biomarker classification paper discriminates between types of pathology based on, say, 30,000 genes and a small labeled sample of less than 100 points. Some classification rule is used to design the classifier from this data, but we are given no good reason or conditions under which this algorithm should perform well. An error estimation rule is used to estimate the classification error on the population using the same data, but once again we are given no good reason or conditions under which this error estimator should produce a good estimate, and thus we do not know how well the classifier should be expected to perform. In fact, virtually, in all such papers the error estimate is expected to be highly inaccurate. In short, we are given no justification for any claims.
Given the ubiquity of vacuous small-sample classification papers in the literature, one could easily conclude that scientific knowledge is impossible in small-sample settings. It is not that thousands of papers overtly claim that scientific knowledge is impossible in regard to their content; rather, it is that they utilize methods that preclude scientific knowledge. In this paper, we argue to the contrary that scientific knowledge in small-sample classification is possible provided there is sufficient prior knowledge. A natural way to proceed, discussed herein, is via a paradigm for pattern recognition in which we incorporate prior knowledge in the whole classification procedure (classifier design and error estimation), optimize each step of the procedure given available information, and obtain theoretical measures of performance for both classifiers and error estimators, the latter being the critical epistemological issue. In sum, we can achieve scientific validation for a proposed small-sample classifier and its error estimate.
Boolean networks and, more generally, probabilistic Boolean networks, as one class of gene regulatory networks, model biological processes with the network dynamics determined by the logic-rule regulatory functions in conjunction with probabilistic parameters involved in network transitions. While there has been significant research on applying different control policies to alter network dynamics as future gene therapeutic intervention, we have seen less work on understanding the sensitivity of network dynamics with respect to perturbations to networks, including regulatory rules and the involved parameters, which is particularly critical for the design of intervention strategies. This paper studies this less investigated issue of network sensitivity in the long run. As the underlying model of probabilistic Boolean networks is a finite Markov chain, we define the network sensitivity based on the steady-state distributions of probabilistic Boolean networks and call it long-run sensitivity. The steady-state distribution reflects the long-run behavior of the network and it can give insight into the dynamics or momentum existing in a system. The change of steady-state distribution caused by possible perturbations is the key measure for intervention. This newly defined long-run sensitivity can provide insight on both network inference and intervention. We show the results for probabilistic Boolean networks generated from random Boolean networks and the results from two real biological networks illustrate preliminary applications of sensitivity in intervention for practical problems.
Genetic regulatory networks; Boolean networks; probabilistic Boolean networks; Markov chains; sensitivity; steady-state distribution; intervention; metastasis
Scientific knowledge is grounded in a particular epistemology and, owing to the requirements of that epistemology, possesses limitations. Some limitations are intrinsic, in the sense that they depend inherently on the nature of scientific knowledge; others are contingent, depending on the present state of knowledge, including technology. Understanding limitations facilitates scientific research because one can then recognize when one is confronted by a limitation, as opposed to simply being unable to solve a problem within the existing bounds of possibility. In the hope that the role of limiting factors can be brought more clearly into focus and discussed, we consider several sources of limitation as they apply to biological knowledge: mathematical complexity, experimental constraints, validation, knowledge discovery, and human intellectual capacity.
Complexity; Gene regulatory networks; Epistemology; Experimental design; Genomics; Knowledge discovery; Modeling; Validation.
Time-course expression profiles and methods for spectrum analysis have been applied for detecting transcriptional periodicities, which are valuable patterns to unravel genes associated with cell cycle and circadian rhythm regulation. However, most of the proposed methods suffer from restrictions and large false positives to a certain extent. Additionally, in some experiments, arbitrarily irregular sampling times as well as the presence of high noise and small sample sizes make accurate detection a challenging task. A novel scheme for detecting periodicities in time-course expression data is proposed, in which a real-valued iterative adaptive approach (RIAA), originally proposed for signal processing, is applied for periodogram estimation. The inferred spectrum is then analyzed using Fisher's hypothesis test. With a proper p-value threshold, periodic genes can be detected. A periodic signal, two nonperiodic signals, and four sampling strategies were considered in the simulations, including both bursts and drops. In addition, two yeast real datasets were applied for validation. The simulations and real data analysis reveal that RIAA can perform competitively with the existing algorithms. The advantage of RIAA is manifested when the expression data are highly irregularly sampled, and when the number of cycles covered by the sampling time points is very reduced.
Motivation: Peptide detection is a crucial step in mass spectrometry (MS) based proteomics. Most existing algorithms are based upon greedy isotope template matching and thus may be prone to error propagation and ineffective to detect overlapping peptides. In addition, existing algorithms usually work at different charge states separately, isolating useful information that can be drawn from other charge states, which may lead to poor detection of low abundance peptides.
Results: BPDA2d models spectra as a mixture of candidate peptide signals and systematically evaluates all possible combinations of possible peptide candidates to interpret the given spectra. For each candidate, BPDA2d takes into account its elution profile, charge state distribution and isotope pattern, and it combines all evidence to infer the candidate's signal and existence probability. By piecing all evidence together—especially by deriving information across charge states—low abundance peptides can be better identified and peptide detection rates can be improved. Instead of local template matching, BPDA2d performs global optimization for all candidates and systematically optimizes their signals. Since BPDA2d looks for the optimal among all possible interpretations of the given spectra, it has the capability in handling complex spectra where features overlap. BPDA2d estimates the posterior existence probability of detected peptides, which can be directly used for probability-based evaluation in subsequent processing steps. Our experiments indicate that BPDA2d outperforms state-of-the-art detection methods on both simulated data and real liquid chromatography–mass spectrometry data, according to sensitivity and detection accuracy.
Availability: The BPDA2d software package is available at http://gsp.tamu.edu/Publications/supplementary/sun11a/
Supplementary data are available at Bioinformatics online.
Drug discovery today is a complex, expensive, and time-consuming process with high attrition rate. A more systematic approach is needed to combine innovative approaches in order to lead to more effective and efficient drug development. This article provides systematic mathematical analysis and dynamical modeling of drug effect under gene regulatory network contexts. A hybrid systems model, which merges together discrete and continuous dynamics into a single dynamical model, is proposed to study dynamics of the underlying regulatory network under drug perturbations. The major goal is to understand how the system changes when perturbed by drugs and give suggestions for better therapeutic interventions. A realistic periodic drug intake scenario is considered, drug pharmacokinetics and pharmacodynamics information being taken into account in the proposed hybrid systems model. Simulations are performed using MATLAB/SIMULINK to corroborate the analytical results.
Drug effect; Hybrid systems; PK/PD; Gene regulatory network (GRN); Dosing regimens
Motivation: In small-sample settings, bolstered error estimation has been shown to perform better than cross-validation and competitively with bootstrap with regard to various criteria. The key issue for bolstering performance is the variance setting for the bolstering kernel. Heretofore, this variance has been determined in a non-parametric manner from the data. Although bolstering based on this variance setting works well for small feature sets, results can deteriorate for high-dimensional feature spaces.
Results: This article computes an optimal kernel variance depending on the classification rule, sample size, model and feature space, both the original number and the number remaining after feature selection. A key point is that the optimal variance is robust relative to the model. This allows us to develop a method for selecting a suitable variance to use in real-world applications where the model is not known, but the other factors in determining the optimal kernel are known.
Availability: Companion website at http://compbio.tgen.org/paper_supp/high_dim_bolstering
For science, theoretical or applied, to significantly advance, researchers must use the most appropriate mathematical methods. A century and a half elapsed between Newton’s development of the calculus and Laplace’s development of celestial mechanics. One cannot imagine the latter without the former. Today, more than three-quarters of a century has elapsed since the birth of stochastic systems theory. This article provides a perspective on the utilization of systems theory as the proper vehicle for the development of systems biology and its application to complex regulatory diseases such as cancer.
cancer; control; epistemology; systems biology
Despite initial response in adjuvant chemotherapy, ovarian cancer patients treated with the combination of paclitaxel and carboplatin frequently suffer from recurrence after few cycles of treatment, and the underlying mechanisms causing the chemoresistance remain unclear. Recently, The Cancer Genome Atlas (TCGA) research network concluded an ovarian cancer study and released the dataset to the public. The TCGA dataset possesses large sample size, comprehensive molecular profiles, and clinical outcome information; however, because of the unknown molecular subtypes in ovarian cancer and the great diversity of adjuvant treatments TCGA patients went through, studying chemotherapeutic response using the TCGA data is difficult. Additionally, factors such as sample batches, patient ages, and tumor stages further confound or suppress the identification of relevant genes, and thus the biological functions and disease mechanisms.
To address these issues, herein we propose an analysis procedure designed to reduce suppression effect by focusing on a specific chemotherapeutic treatment, and to remove confounding effects such as batch effect, patient's age, and tumor stages. The proposed procedure starts with a batch effect adjustment, followed by a rigorous sample selection process. Then, the gene expression, copy number, and methylation profiles from the TCGA ovarian cancer dataset are analyzed using a semi-supervised clustering method combined with a novel scoring function. As a result, two molecular classifications, one with poor copy number profiles and one with poor methylation profiles, enriched with unfavorable scores are identified. Compared with the samples enriched with favorable scores, these two classifications exhibit poor progression-free survival (PFS) and might be associated with poor chemotherapy response specifically to the combination of paclitaxel and carboplatin. Significant genes and biological processes are detected subsequently using classical statistical approaches and enrichment analysis.
The proposed procedure for the reduction of confounding and suppression effects and the semi-supervised clustering method are essential steps to identify genes associated with the chemotherapeutic response.
Mass spectrometry is a complex technique used for large-scale protein profiling with clinical and pharmaceutical applications. While individual components in the system have been studied extensively, little work has been done to integrate various modules and evaluate them from a systems point of view.
In this work, we investigate this problem by putting together the different modules in a typical proteomics work flow, in order to capture and analyze key factors that impact the number of identified peptides and quantified proteins, protein quantification error, differential expression results, and classification performance. The proposed proteomics pipeline model can be used to optimize the work flow as well as to pinpoint critical bottlenecks worth investing time and resources into for improving performance. Using the model-based approach proposed here, one can study systematically the critical problem of proteomic biomarker discovery, by means of simulation using ground-truthed synthetic MS data.
Molecularly targeted agents (MTAs) are increasingly used for cancer treatment, the goal being to improve the efficacy and selectivity of cancer treatment by developing agents that block the growth of cancer cells by interfering with specific targeted molecules needed for carcinogenesis and tumor growth. This approach differs from traditional cytotoxic anticancer drugs. The lack of specificity of cytotoxic drugs allows a relatively straightforward approach in preclinical and clinical studies, where the optimal dose has usually been defined as the "maximum tolerated dose" (MTD). This toxicity-based dosing approach is founded on the assumption that the therapeutic anticancer effect and toxic effects of the drug increase in parallel as the dose is escalated. On the contrary, most MTAs are expected to be more selective and less toxic than cytotoxic drugs. Consequently, the maximum therapeutic effect may be achieved at a "biologically effective dose" (BED) well below the MTD. Hence, dosing study for MTAs should be different from cytotoxic drugs. Enhanced efforts to molecularly characterize the drug efficacy for MTAs in preclinical models will be valuable for successfully designing dosing regimens for clinical trials.
A novel preclinical model combining experimental methods and theoretical analysis is proposed to investigate the mechanism of action and identify pharmacodynamic characteristics of the drug. Instead of fixed time point analysis of the drug exposure to drug effect, the time course of drug effect for different doses is quantitatively studied on cell line-based platforms using system identification, where tumor cells' responses to drugs through the use of fluorescent reporters are sampled over a time course. Results show that drug effect is time-varying and higher dosages induce faster and stronger responses as expected. However, the drug efficacy change along different dosages is not linear; on the contrary, there exist certain thresholds. This kind of preclinical study can provide valuable suggestions about dosing regimens for the in vivo experimental stage to increase productivity.
Motivation: There is growing discussion in the bioinformatics community concerning overoptimism of reported results. Two approaches contributing to overoptimism in classification are (i) the reporting of results on datasets for which a proposed classification rule performs well and (ii) the comparison of multiple classification rules on a single dataset that purports to show the advantage of a certain rule.
Results: This article provides a careful probabilistic analysis of the second issue and the ‘multiple-rule bias’, resulting from choosing a classification rule having minimum estimated error on the dataset. It quantifies this bias corresponding to estimating the expected true error of the classification rule possessing minimum estimated error and it characterizes the bias from estimating the true comparative advantage of the chosen classification rule relative to the others by the estimated comparative advantage on the dataset. The analysis is applied to both synthetic and real data using a number of classification rules and error estimators.
Availability: We have implemented in C code the synthetic data distribution model, classification rules, feature selection routines and error estimation methods. The code for multiple-rule analysis is implemented in MATLAB. The source code is available at http://gsp.tamu.edu/Publications/supplementary/yousefi11a/. Supplementary simulation results are also included.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Motivated by the frustration of translation of research advances in the molecular and cellular biology of cancer into treatment, this study calls for cross-disciplinary efforts and proposes a methodology of incorporating drug pharmacology information into drug therapeutic response modeling using a computational systems biology approach. The objectives are two fold. The first one is to involve effective mathematical modeling in the drug development stage to incorporate preclinical and clinical data in order to decrease costs of drug development and increase pipeline productivity, since it is extremely expensive and difficult to get the optimal compromise of dosage and schedule through empirical testing. The second objective is to provide valuable suggestions to adjust individual drug dosing regimens to improve therapeutic effects considering most anticancer agents have wide inter-individual pharmacokinetic variability and a narrow therapeutic index. A dynamic hybrid systems model is proposed to study drug antitumor effect from the perspective of tumor growth dynamics, specifically the dosing and schedule of the periodic drug intake, and a drug’s pharmacokinetics and pharmacodynamics information are linked together in the proposed model using a state-space approach. It is proved analytically that there exists an optimal drug dosage and interval administration point, and demonstrated through simulation study.
drug effect; drug efficacy region; dosing regimens; hybrid systems; systems biology; tumor growth
Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.
Classification; epistemology; error estimation; genomics; validation.
Motivation: A key goal of studying biological systems is to design therapeutic intervention strategies. Probabilistic Boolean networks (PBNs) constitute a mathematical model which enables modeling, predicting and intervening in their long-run behavior using Markov chain theory. The long-run dynamics of a PBN, as represented by its steady-state distribution (SSD), can guide the design of effective intervention strategies for the modeled systems. A major obstacle for its application is the large state space of the underlying Markov chain, which poses a serious computational challenge. Hence, it is critical to reduce the model complexity of PBNs for practical applications.
Results: We propose a strategy to reduce the state space of the underlying Markov chain of a PBN based on a criterion that the reduction least distorts the proportional change of stationary masses for critical states, for instance, the network attractors. In comparison to previous reduction methods, we reduce the state space directly, without deleting genes. We then derive stationary control policies on the reduced network that can be naturally induced back to the original network. Computational experiments study the effects of the reduction on model complexity and the performance of designed control policies which is measured by the shift of stationary mass away from undesirable states, those associated with undesirable phenotypes. We consider randomly generated networks as well as a 17-gene gastrointestinal cancer network, which, if not reduced, has a 217 × 217 transition probability matrix. Such a dimension is too large for direct application of many previously proposed PBN intervention strategies.
Supplementary information: Supplementary information are available at Bioinformatics online.