The current interest in big data, machine learning and data analytics has generated the widespread impression that such methods are capable of solving most problems without the need for conventional scientific methods of inquiry. Interest in these methods is intensifying, accelerated by the ease with which digitized data can be acquired in virtually all fields of endeavour, from science, healthcare and cybersecurity to economics, social sciences and the humanities. In multiscale modelling, machine learning appears to provide a shortcut to reveal correlations of arbitrary complexity between processes at the atomic, molecular, meso- and macroscales. Here, we point out the weaknesses of pure big data approaches with particular focus on biology and medicine, which fail to provide conceptual accounts for the processes to which they are applied. No matter their ‘depth’ and the sophistication of data-driven methods, such as artificial neural nets, in the end they merely fit curves to existing data. Not only do these methods invariably require far larger quantities of data than anticipated by big data aficionados in order to produce statistically reliable results, but they can also fail in circumstances beyond the range of the data used to train them because they are not designed to model the structural characteristics of the underlying system. We argue that it is vital to use theory as a guide to experimental design for maximal efficiency of data collection and to produce reliable predictive models and conceptual knowledge. Rather than continuing to fund, pursue and promote ‘blind’ big data projects with massive budgets, we call for more funding to be allocated to the elucidation of the multiscale and stochastic processes controlling the behaviour of complex systems, including those of life, medicine and healthcare.
This article is part of the themed issue ‘Multiscale modelling at the physics–chemistry–biology interface’.
machine learning; big data; personalized medicine; biomedicine; epistemology
The landscape of translational research has been shifting toward drug combination therapies. Pairing of drugs allows for more types of drug interaction with cells. In order to accurately and comprehensively assess combinational drug efficacy, analytical methods capable of recognizing these alternative reactions will be required to prioritize those drug candidates having better chances of delivering appreciable therapeutic benefits. Traditional efficacy measures are primarily based on the “extent” of drug inhibition, which is the percentage of cells being killed after drug exposure. Here, we introduce a second dimension of evaluation criterion, speed of killing, based on a live cell imaging assay. This dynamic response trajectory approach takes advantage of both “extent” and “speed” information and uncovers synergisms that would otherwise be missed, while also generating hypotheses regarding important mechanistic modes of drug action.
combinational drug; synergism; drug response; dynamics; cell imaging
The most important aspect of any classifier is its error rate, because this quantifies its predictive capacity. Thus, the accuracy of error estimation is critical. Error estimation is problematic in small-sample classifier design because the error must be estimated using the same data from which the classifier has been designed. Use of prior knowledge, in the form of a prior distribution on an uncertainty class of feature-label distributions to which the true, but unknown, feature-distribution belongs, can facilitate accurate error estimation (in the mean-square sense) in circumstances where accurate completely model-free error estimation is impossible. This paper provides analytic asymptotically exact finite-sample approximations for various performance metrics of the resulting Bayesian Minimum Mean-Square-Error (MMSE) error estimator in the case of linear discriminant analysis (LDA) in the multivariate Gaussian model. These performance metrics include the first, second, and cross moments of the Bayesian MMSE error estimator with the true error of LDA, and therefore, the Root-Mean-Square (RMS) error of the estimator. We lay down the theoretical groundwork for Kolmogorov double-asymptotics in a Bayesian setting, which enables us to derive asymptotic expressions of the desired performance metrics. From these we produce analytic finite-sample approximations and demonstrate their accuracy via numerical examples. Various examples illustrate the behavior of these approximations and their use in determining the necessary sample size to achieve a desired RMS. The Supplementary Material contains derivations for some equations and added figures.
Linear discriminant analysis; Bayesian Minimum Mean-Square Error Estimator; Double asymptotics; Kolmogorov asymptotics; Performance metrics; RMS
In classification, prior knowledge is incorporated in a Bayesian framework by assuming that the feature-label distribution belongs to an uncertainty class of feature-label distributions governed by a prior distribution. A posterior distribution is then derived from the prior and the sample data. An optimal Bayesian classifier (OBC) minimizes the expected misclassification error relative to the posterior distribution. From an application perspective, prior construction is critical. The prior distribution is formed by mapping a set of mathematical relations among the features and labels, the prior knowledge, into a distribution governing the probability mass across the uncertainty class. In this paper, we consider prior knowledge in the form of stochastic differential equations (SDEs). We consider a vector SDE in integral form involving a drift vector and dispersion matrix. Having constructed the prior, we develop the optimal Bayesian classifier between two models and examine, via synthetic experiments, the effects of uncertainty in the drift vector and dispersion matrix. We apply the theory to a set of SDEs for the purpose of differentiating the evolutionary history between two species.
Electronic supplementary material
The online version of this article (doi:10.1186/s13637-016-0036-y) contains supplementary material, which is available to authorized users.
Classification; Gaussian processes; Stochastic differential equations; Optimal Bayesian classifier
In this paper, we review multiscale modeling for cancer treatment with the incorporation of drug effects from an applied system’s pharmacology perspective. Both the classical pharmacology and systems biology are inherently quantitative; however, systems biology focuses more on networks and multi factorial controls over biological processes rather than on drugs and targets in isolation, whereas systems pharmacology has a strong focus on studying drugs with regard to the pharmacokinetic (PK) and pharmacodynamic (PD) relations accompanying drug interactions with multiscale physiology as well as the prediction of dosage-exposure responses and economic potentials of drugs. Thus, it requires multiscale methods to address the need for integrating models from the molecular levels to the cellular, tissue, and organism levels. It is a common belief that tumorigenesis and tumor growth can be best understood and tackled by employing and integrating a multifaceted approach that includes in vivo and in vitro experiments, in silico models, multiscale tumor modeling, continuous/discrete modeling, agent-based modeling, and multiscale modeling with PK/PD drug effect inputs. We provide an example application of multiscale modeling employing stochastic hybrid system for a colon cancer cell line HCT-116 with the application of Lapatinib drug. It is observed that the simulation results are similar to those observed from the setup of the wet-lab experiments at the Translational Genomics Research Institute.
multiscale model; cancer; drug effect modeling; pharmacodynamics; pharmacokinetics
Motivation: It is commonly assumed in pattern recognition that cross-validation error estimation is ‘almost unbiased’ as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics.
Results: We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an ‘almost unbiased’ theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used.
Availability and implementation: The source code in C++, along with the Supplementary Materials, is available at: http://gsp.tamu.edu/Publications/supplementary/zollanvari13/.
Supplementary data are available at Bioinformatics online.
An accurate understanding of interactions among genes plays a major role in developing therapeutic intervention methods. Gene regulatory networks often contain a significant amount of uncertainty. The process of prioritizing biological experiments to reduce the uncertainty of gene regulatory networks is called experimental design. Under such a strategy, the experiments with high priority are suggested to be conducted first.
The authors have already proposed an optimal experimental design method based upon the objective for modeling gene regulatory networks, such as deriving therapeutic interventions. The experimental design method utilizes the concept of mean objective cost of uncertainty (MOCU). MOCU quantifies the expected increase of cost resulting from uncertainty. The optimal experiment to be conducted first is the one which leads to the minimum expected remaining MOCU subsequent to the experiment. In the process, one must find the optimal intervention for every gene regulatory network compatible with the prior knowledge, which can be prohibitively expensive when the size of the network is large. In this paper, we propose a computationally efficient experimental design method. This method incorporates a network reduction scheme by introducing a novel cost function that takes into account the disruption in the ranking of potential experiments. We then estimate the approximate expected remaining MOCU at a lower computational cost using the reduced networks.
Simulation results based on synthetic and real gene regulatory networks show that the proposed approximate method has close performance to that of the optimal method but at lower computational cost. The proposed approximate method also outperforms the random selection policy significantly. A MATLAB software implementing the proposed experimental design method is available at http://gsp.tamu.edu/Publications/supplementary/roozbeh15a/.
Experimental design; gene regulatory networks; mean objective cost of uncertainty; objective-based network reduction; Boolean networks; structural intervention
Most dynamical models for genomic networks are built upon two current methodologies, one process-based and the other based on Boolean-type networks. Both are problematic when it comes to experimental design purposes in the laboratory. The first approach requires a comprehensive knowledge of the parameters involved in all biological processes a priori, whereas the results from the second method may not have a biological correspondence and thus cannot be tested in the laboratory. Moreover, the current methods cannot readily utilize existing curated knowledge databases and do not consider uncertainty in the knowledge. Therefore, a new methodology is needed that can generate a dynamical model based on available biological data, assuming uncertainty, while the results from experimental design can be examined in the laboratory.
We propose a new methodology for dynamical modeling of genomic networks that can utilize the interaction knowledge provided in public databases. The model assigns discrete states for physical entities, sets priorities among interactions based on information provided in the database, and updates each interaction based on associated node states. Whenever uncertainty in dynamics arises, it explores all possible outcomes. By using the proposed model, biologists can study regulation networks that are too complex for manual analysis.
The proposed approach can be effectively used for constructing dynamical models of interaction-based genomic networks without requiring a complete knowledge of all parameters affecting the network dynamics, and thus based on a small set of available data.
Dynamical model; Uncertain networks; Algorithm design
Contemporary high-throughput technologies provide measurements of very large numbers of variables but often with very small sample sizes. This paper proposes an optimization-based paradigm for utilizing prior knowledge to design better performing classifiers when sample sizes are limited. We derive approximate expressions for the first and second moments of the true error rate of the proposed classifier under the assumption of two widely-used models for the uncertainty classes; ε-contamination and p-point classes. The applicability of the approximate expressions is discussed by defining the problem of finding optimal regularization parameters through minimizing the expected true error. Simulation results using the Zipf model show that the proposed paradigm yields improved classifiers that outperform traditional classifiers that use only training data. Our application of interest involves discrete gene regulatory networks possessing labeled steady-state distributions. Given prior operational knowledge of the process, our goal is to build a classifier that can accurately label future observations obtained in the steady state by utilizing both the available prior knowledge and the training data. We examine the proposed paradigm on networks containing NF-κB pathways, where it shows significant improvement in classifier performance over the classical data-only approach to classifier design. Companion website: http://gsp.tamu.edu/Publications/supplementary/shahrokh12a.
Steady-state classifier; biological-pathway knowledge; uncertainty class; regularized
The mechanisms by which n-3 polyunsaturated fatty acids (PUFAs) decrease colon tumor formation have not been fully elucidated. Examination of genes up- or down-regulated at various stages of tumor development via the monitoring of gene expression relationships will help to determine the biological processes ultimately responsible for the protective effects of n-3 PUFA. Therefore, using a 3 × × × 2 factorial design, we used Codelink DNA microarrays containing ∼9000 genes to help decipher the global changes in colonocyte gene expression profiles in carcinogen-injected Sprague Dawley rats. Animals were assigned to three dietary treatments differing only in the type of fat (corn oil/n-6 PUFA, fish oil/n-3 PUFA, or olive oil/n-9 monounsaturated fatty acid), two treatments (injection with the carcinogen azoxymethane or with saline), and two time points (12 hours and 10 weeks after first injection). Only the consumption of n-3 PUFA exerted a protective effect at the initiation (DNA adduct formation) and promotional (aberrant crypt foci) stages. Importantly, microarray analysis of colonocyte gene expression profiles discerned fundamental differences among animals treated with n-3 PUFA at both the 12 hours and 10-week time points. Thus, in addition to demonstrating that dietary fat composition alters the molecular portrait of gene expression profiles in the colonic epithelium at both the initiation and promotional stages of tumor development, these findings indicate that the chemopreventive effect of fish oil is due to the direct action of n-3 PUFA and not to a reduction in the content of n-6 PUFA.
There are two distinct issues regarding network validation: (1) Does an inferred network provide good predictions relative to experimental data? (2) Does a network inference algorithm applied within a certain network model framework yield networks that are accurate relative to some criterion of goodness? The first issue concerns scientific validation and the second concerns algorithm validation. In this paper we consider inferential validation relative to controllability; that is, if an inference procedure is applied to data generated from a gene regulatory network and an intervention procedure is designed on the inferred network, how well does it perform on the true network? The reasoning behind such a criterion is that, if our purpose is to use gene regulatory networks to design therapeutic intervention strategies, then we are not concerned with network fidelity, per se, but only with our ability to design effective interventions based on the inferred network. We will consider the problem from the perspectives of stationary control, which involves designing a control policy to be applied over time based on the current state of the network, with the decision procedure itself being time independent. The objective of a control policy is to optimally reduce the total steady-state probability mass of the undesirable states (phenotypes), which is equivalent to optimally increasing the total steady-state mass of the desirable states. Based on this criterion we compare several proposed network inference procedures. We will see that inference procedure ψ may perform poorer than inference procedure ξ relative to inferring the full network structure but perform better than ξ relative to controllability. Hence, when one is aiming at a specific application, it may be wise to use an objective-based measure of inference validity.
network inference; genetic regulatory network; control; validation; probabilistic Boolean network
Sequencing datasets consist of a finite number of reads which map to specific regions of a reference genome. Most effort in modeling these datasets focuses on the detection of univariate differentially expressed genes. However, for classification, we must consider multiple genes and their interactions.
Thus, we introduce a hierarchical multivariate Poisson model (MP) and the associated optimal Bayesian classifier (OBC) for classifying samples using sequencing data. Lacking closed-form solutions, we employ a Monte Carlo Markov Chain (MCMC) approach to perform classification. We demonstrate superior or equivalent classification performance compared to typical classifiers for two synthetic datasets and over a range of classification problem difficulties. We also introduce the Bayesian minimum mean squared error (MMSE) conditional error estimator and demonstrate its computation over the feature space. In addition, we demonstrate superior or leading class performance over an RNA-Seq dataset containing two lung cancer tumor types from The Cancer Genome Atlas (TCGA).
Through model-based, optimal Bayesian classification, we demonstrate superior classification performance for both synthetic and real RNA-Seq datasets. A tutorial video and Python source code is available under an open source license at http://bit.ly/1gimnss.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0401-3) contains supplementary material, which is available to authorized users.
Classification; RNA-Seq; Model-based; Bayesian
Restricted Boolean networks are simplified Boolean networks that are required for either negative or positive regulations between genes. Higa et al. (BMC Proc 5:S5, 2011) proposed a three-rule algorithm to infer a restricted Boolean network from time-series data. However, the algorithm suffers from a major drawback, namely, it is very sensitive to noise. In this paper, we systematically analyze the regulatory relationships between genes based on the state switch of the target gene and propose an algorithm with which restricted Boolean networks may be inferred from time-series data. We compare the proposed algorithm with the three-rule algorithm and the best-fit algorithm based on both synthetic networks and a well-studied budding yeast cell cycle network. The performance of the algorithms is evaluated by three distance metrics: the normalized-edge Hamming distance μhame, the normalized Hamming distance of state transition μhamst, and the steady-state distribution distance μssd. Results show that the proposed algorithm outperforms the others according to both μhame and μhamst, whereas its performance according to μssd is intermediate between best-fit and the three-rule algorithms. Thus, our new algorithm is more appropriate for inferring interactions between genes from time-series data.
Restricted Boolean network; Inference; Budding yeast cell cycle
Perfect knowledge of the underlying state transition probabilities is necessary for designing an optimal intervention strategy for a given Markovian genetic regulatory network. However, in many practical situations, the complex nature of the network and/or identification costs limit the availability of such perfect knowledge. To address this difficulty, we propose to take a Bayesian approach and represent the system of interest as an uncertainty class of several models, each assigned some probability, which reflects our prior knowledge about the system. We define the objective function to be the expected cost relative to the probability distribution over the uncertainty class and formulate an optimal Bayesian robust intervention policy minimizing this cost function. The resulting policy may not be optimal for a fixed element within the uncertainty class, but it is optimal when averaged across the uncertainly class. Furthermore, starting from a prior probability distribution over the uncertainty class and collecting samples from the process over time, one can update the prior distribution to a posterior and find the corresponding optimal Bayesian robust policy relative to the posterior distribution. Therefore, the optimal intervention policy is essentially nonstationary and adaptive.
Optimal intervention; Markovian gene regulatory networks; Probabilistic Boolean networks; Uncertainty; Prior knowledge; Bayesian control
Gene set enrichment analysis (GSA) methods have been widely adopted by biological labs to analyze data and generate hypotheses for validation. Most of the existing comparison studies focus on whether the existing GSA methods can produce accurate P-values; however, practitioners are often more concerned with the correct gene-set ranking generated by the methods. The ranking performance is closely related to two critical goals associated with GSA methods: the ability to reveal biological themes and ensuring reproducibility, especially for small-sample studies. We have conducted a comprehensive simulation study focusing on the ranking performance of seven representative GSA methods. We overcome the limitation on the availability of real data sets by creating hybrid data models from existing large data sets. To build the data model, we pick a master gene from the data set to form the ground truth and artificially generate the phenotype labels. Multiple hybrid data models can be constructed from one data set and multiple data sets of smaller sizes can be generated by resampling the original data set. This approach enables us to generate a large batch of data sets to check the ranking performance of GSA methods. Our simulation study reveals that for the proposed data model, the Q2 type GSA methods have in general better performance than other GSA methods and the global test has the most robust results. The properties of a data set play a critical role in the performance. For the data sets with highly connected genes, all GSA methods suffer significantly in performance.
gene set enrichment analysis; feature ranking; data model; simulation study
Boolean networks and, more generally, probabilistic Boolean networks, as one class of gene regulatory networks, model biological processes with the network dynamics determined by the logic-rule regulatory functions in conjunction with probabilistic parameters involved in network transitions. While there has been significant research on applying different control policies to alter network dynamics as future gene therapeutic intervention, we have seen less work on understanding the sensitivity of network dynamics with respect to perturbations to networks, including regulatory rules and the involved parameters, which is particularly critical for the design of intervention strategies. This paper studies this less investigated issue of network sensitivity in the long run. As the underlying model of probabilistic Boolean networks is a finite Markov chain, we define the network sensitivity based on the steady-state distributions of probabilistic Boolean networks and call it long-run sensitivity. The steady-state distribution reflects the long-run behavior of the network and it can give insight into the dynamics or momentum existing in a system. The change of steady-state distribution caused by possible perturbations is the key measure for intervention. This newly defined long-run sensitivity can provide insight on both network inference and intervention. We show the results for probabilistic Boolean networks generated from random Boolean networks and the results from two real biological networks illustrate preliminary applications of sensitivity in intervention for practical problems.
Genetic regulatory networks; Boolean networks; probabilistic Boolean networks; Markov chains; sensitivity; steady-state distribution; intervention; metastasis
Hannah Arendt, one of the foremost political philosophers of the twentieth century, has argued that it is the responsibility of educators not to leave children in their own world but instead to bring them into the adult world so that, as adults, they can carry civilization forward to whatever challenges it will face by bringing to bear the learning of the past. In the same collection of essays, she discusses the recognition by modern science that Nature is inconceivable in terms of ordinary human conceptual categories - as she writes, ‘unthinkable in terms of pure reason’. Together, these views on scientific education lead to an educational process that transforms children into adults, with a scientific adult being one who has the ability to conceptualize scientific systems independent of ordinary physical intuition. This article begins with Arendt’s basic educational and scientific points and develops from them a critique of current scientific education in conjunction with an appeal to educate young scientists in a manner that allows them to fulfill their potential ‘on the shoulders of giants’. While the article takes a general philosophical perspective, its specifics tend to be directed at biomedical education, in particular, how such education pertains to translational science.
Motivation: A common practice in biomarker discovery is to decide whether a large laboratory experiment should be carried out based on the results of a preliminary study on a small set of specimens. Consideration of the efficacy of this approach motivates the introduction of a probabilistic measure, for whether a classifier showing promising results in a small-sample preliminary study will perform similarly on a large independent sample. Given the error estimate from the preliminary study, if the probability of reproducible error is low, then there is really no purpose in substantially allocating more resources to a large follow-on study. Indeed, if the probability of the preliminary study providing likely reproducible results is small, then why even perform the preliminary study?
Results: This article introduces a reproducibility index for classification, measuring the probability that a sufficiently small error estimate on a small sample will motivate a large follow-on study. We provide a simulation study based on synthetic distribution models that possess known intrinsic classification difficulties and emulate real-world scenarios. We also set up similar simulations on four real datasets to show the consistency of results. The reproducibility indices for different distributional models, real datasets and classification schemes are empirically calculated. The effects of reporting and multiple-rule biases on the reproducibility index are also analyzed.
Availability: We have implemented in C code the synthetic data distribution model, classification rules, feature selection routine and error estimation methods. The source code is available at http://gsp.tamu.edu/Publications/supplementary/yousefi12a/. Supplementary simulation results are also included.
Supplementary data are available at Bioinformatics online.
This paper provides exact analytical expressions for the first and second moments of the true error for linear discriminant analysis (LDA) when the data are univariate and taken from two stochastic Gaussian processes. The key point is that we assume a general setting in which the sample data from each class do not need to be identically distributed or independent within or between classes. We compare the true errors of designed classifiers under the typical i.i.d. model and when the data are correlated, providing exact expressions and demonstrating that, depending on the covariance structure, correlated data can result in classifiers with either greater error or less error than when training with uncorrelated data. The general theory is applied to autoregressive and moving-average models of the first order, and it is demonstrated using real genomic data.
Linear discriminant analysis; Stochastic settings; Correlated data; Non-i.i.d data; Expected error; Gaussian processes; Auto-regressive models; Moving-average models
A key goal of systems biology and translational genomics is to utilize high-throughput measurements of cellular states to develop expression-based classifiers for discriminating among different phenotypes. Recent developments of Next Generation Sequencing (NGS) technologies can facilitate classifier design by providing expression measurements for tens of thousands of genes simultaneously via the abundance of their mRNA transcripts. Because NGS technologies result in a nonlinear transformation of the actual expression distributions, their application can result in data that are less discriminative than would be the actual expression levels themselves, were they directly observable.
Using state-of-the-art distributional modeling for the NGS processing pipeline, this paper studies how that pipeline, via the resulting nonlinear transformation, affects classification and feature selection. The effects of different factors are considered and NGS-based classification is compared to SAGE-based classification and classification directly on the raw expression data, which is represented by a very high-dimensional model previously developed for gene expression. As expected, the nonlinear transformation resulting from NGS processing diminishes classification accuracy; however, owing to a larger number of reads, NGS-based classification outperforms SAGE-based classification.
Having high numbers of reads can mitigate the degradation in classification performance resulting from the effects of NGS technologies. Hence, when performing a RNA-Seq analysis, using the highest possible coverage of the genome is recommended for the purposes of classification.
A typical small-sample biomarker classification paper discriminates between types of pathology based on, say, 30,000 genes and a small labeled sample of less than 100 points. Some classification rule is used to design the classifier from this data, but we are given no good reason or conditions under which this algorithm should perform well. An error estimation rule is used to estimate the classification error on the population using the same data, but once again we are given no good reason or conditions under which this error estimator should produce a good estimate, and thus we do not know how well the classifier should be expected to perform. In fact, virtually, in all such papers the error estimate is expected to be highly inaccurate. In short, we are given no justification for any claims.
Given the ubiquity of vacuous small-sample classification papers in the literature, one could easily conclude that scientific knowledge is impossible in small-sample settings. It is not that thousands of papers overtly claim that scientific knowledge is impossible in regard to their content; rather, it is that they utilize methods that preclude scientific knowledge. In this paper, we argue to the contrary that scientific knowledge in small-sample classification is possible provided there is sufficient prior knowledge. A natural way to proceed, discussed herein, is via a paradigm for pattern recognition in which we incorporate prior knowledge in the whole classification procedure (classifier design and error estimation), optimize each step of the procedure given available information, and obtain theoretical measures of performance for both classifiers and error estimators, the latter being the critical epistemological issue. In sum, we can achieve scientific validation for a proposed small-sample classifier and its error estimate.
Scientific knowledge is grounded in a particular epistemology and, owing to the requirements of that epistemology, possesses limitations. Some limitations are intrinsic, in the sense that they depend inherently on the nature of scientific knowledge; others are contingent, depending on the present state of knowledge, including technology. Understanding limitations facilitates scientific research because one can then recognize when one is confronted by a limitation, as opposed to simply being unable to solve a problem within the existing bounds of possibility. In the hope that the role of limiting factors can be brought more clearly into focus and discussed, we consider several sources of limitation as they apply to biological knowledge: mathematical complexity, experimental constraints, validation, knowledge discovery, and human intellectual capacity.
Complexity; Gene regulatory networks; Epistemology; Experimental design; Genomics; Knowledge discovery; Modeling; Validation.
Time-course expression profiles and methods for spectrum analysis have been applied for detecting transcriptional periodicities, which are valuable patterns to unravel genes associated with cell cycle and circadian rhythm regulation. However, most of the proposed methods suffer from restrictions and large false positives to a certain extent. Additionally, in some experiments, arbitrarily irregular sampling times as well as the presence of high noise and small sample sizes make accurate detection a challenging task. A novel scheme for detecting periodicities in time-course expression data is proposed, in which a real-valued iterative adaptive approach (RIAA), originally proposed for signal processing, is applied for periodogram estimation. The inferred spectrum is then analyzed using Fisher's hypothesis test. With a proper p-value threshold, periodic genes can be detected. A periodic signal, two nonperiodic signals, and four sampling strategies were considered in the simulations, including both bursts and drops. In addition, two yeast real datasets were applied for validation. The simulations and real data analysis reveal that RIAA can perform competitively with the existing algorithms. The advantage of RIAA is manifested when the expression data are highly irregularly sampled, and when the number of cycles covered by the sampling time points is very reduced.