PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1441262)

Clipboard (0)
None

Related Articles

1.  Circadian signatures in rat liver: from gene expression to pathways 
BMC Bioinformatics  2010;11:540.
Background
Circadian rhythms are 24 hour oscillations in many behavioural, physiological, cellular and molecular processes that are controlled by an endogenous clock which is entrained to environmental factors including light, food and stress. Transcriptional analyses of circadian patterns demonstrate that genes showing circadian rhythms are part of a wide variety of biological pathways.
Pathway activity method can identify the significant pattern of the gene expression levels within a pathway. In this method, the overall gene expression levels are translated to a reduced form, pathway activity levels, via singular value decomposition (SVD). A given pathway represented by pathway activity levels can then be as analyzed using the same approaches used for analyzing gene expression levels. We propose to use pathway activity method across time to identify underlying circadian pattern of pathways.
Results
We used synthetic data to demonstrate that pathway activity analysis can evaluate the underlying circadian pattern within a pathway even when circadian patterns cannot be captured by the individual gene expression levels. In addition, we illustrated that pathway activity formulation should be coupled with a significance analysis to distinguish biologically significant information from random deviations. Next, we performed pathway activity level analysis on a rich time series of transcriptional profiling in rat liver. The over-represented five specific patterns of pathway activity levels, which cannot be explained by random event, exhibited circadian rhythms. The identification of the circadian signatures at the pathway level identified 78 pathways related to energy metabolism, amino acid metabolism, lipid metabolism and DNA replication and protein synthesis, which are biologically relevant in rat liver. Further, we observed tight coordination between cholesterol biosynthesis and bile acid biosynthesis as well as between folate biosynthesis, one carbon pool by folate and purine-pyrimidine metabolism. These coupled pathways are parts of a sequential reaction series where the product of one pathway is the substrate of another pathway.
Conclusions
Rather than assessing the importance of a single gene beforehand and map these genes onto pathways, we instead examined the orchestrated change within a pathway. Pathway activity level analysis could reveal the underlying circadian dynamics in the microarray data with an unsupervised approach and biologically relevant results were obtained.
doi:10.1186/1471-2105-11-540
PMCID: PMC2990769  PMID: 21040584
2.  Extension of a genetic network model by iterative experimentation and mathematical analysis 
Molecular Systems Biology  2005;1:2005.0013.
We extend the current model of the plant circadian clock, in order to accommodate new and published data. Throughout our model development we use a global parameter search to ensure that any limitations we find are due to the network architecture and not to our selection of the parameter values, which have not been determined experimentally. Our final model includes two, interlocked loops of gene regulation and is reminiscent of the circuit structures previously identified by experiments on insect and fungal clocks. It is the first Arabidopsis clock model to show such good correspondence to experimental data.Our interlocked feedback loop model predicts the regulation of two unknown components. Experiments motivated by these predictions identify the GIGANTEA gene as a strong candidate for one component, with an unexpected pattern of light regulation.*
This study involves an iterative approach of mathematical modelling and experiment to develop an accurate mathematical model of the circadian clock in the higher plant Arabidopsis thaliana. Our approach is central to systems biology and should lead to a greater, quantitative understanding of the circadian clock, as well as being more widely relevant to research into genetic networks.
The day–night cycle caused by the Earth's rotation affects most organisms, and has resulted in the evolution of the circadian clock. The circadian clock controls 24-h rhythms in processes from metabolism to behaviour; in higher eukaryotes, the circadian clock controls the rhythmic expression of 5–10% of genes. In plants, the clock controls leaf and petal movements, the opening and closing of stomatal pores, the discharge of floral fragrances and many metabolic activities, especially those associated with photosynthesis.
The relatively small number of components involved in the central circadian network makes it an ideal candidate for mathematical modelling of complex biological regulation. Genetic studies in a variety of model organisms have shown that the circadian rhythm is generated by a central network of between 6 and 12 genes. These genes form feedback loops generating a rhythm in mRNA production. One negative feedback loop in which a gene encodes a protein that, after several hours, turns off transcription is, in principle, capable of creating a circadian rhythm. However, real circadian clocks have proven to be more complicated than this, with interlocked feedback loops. Networks of this complexity are more easily understood through mathematical modelling.
The clock mechanism in the model plant, A. thaliana, was first proposed to comprise a feedback loop in which two partially redundant genes, LATE ELONGATED HYPOCOTYL (LHY) and CIRCADIAN CLOCK ASSOCIATED 1 (CCA1), repress the expression of their activator, TIMING OF CAB EXPRESSION 1 (TOC1). We previously modelled this preliminary network and showed that it was not capable of recreating several important pieces of experimental data (Locke et al, 2005). Here, we extend the LHY/CCA1–TOC1 network in new mathematical models. To check the effects of each addition to the network, the outputs of the extended models are compared to published data and to new experiments.
As is the case for most biological networks, the parameter values in our model, such as the translation rate of TOC1 protein, are unknown. We employ here an optimisation method, which works well with noisy and varied data and allows a global search of parameter space. This should ensure that the limitations we find in our networks are due to the network structure, and not to our parameter choices.
Our final interlocked feedback loop model requires two hypothetical components, genes X and Y (Figure 4), but is the first Arabidopsis clock model to exhibit such a good correspondence with experimental data. The model simulates a residual short-period oscillation in the cca1;lhy mutant, as characterised by our experiments. No single-loop model is able to do this. Our model also matches experimental data under constant light (LL) conditions and correctly senses photoperiod. The model predicts an interlocked feedback loop structure similar to that seen in the circadian clock mechanisms of other organisms.
The interlocked feedback loop model predicts a distinctive pattern of Y mRNA accumulation in the wild type (WT) and in the cca1;lhy double mutant, with Y mRNA levels increasing transiently at dawn. We designed an experiment to identify Y based on this prediction. GIGANTEA (GI) mRNA levels fit very well to our predicted profile for Y (Figure 6), identifying GI as a strong candidate for Y.
The approach described here could act as a template for experimental biologists seeking to extend models of small genetic networks. Our results illustrate the usefulness of mathematical modelling in guiding experiments, even if the models are based on limited data. Our method provides a way of identifying suitable candidate networks and quantifying how these networks better describe a wide variety of experimental measurements. The characteristics of new putative genes are thereby obtained, facilitating the experimental search for new components. To facilitate future experimental design, we provide user-friendly software that is specifically designed for numerical simulation of circadian experiments using models for several species (Brown, 2004b).
*Footnote: Synopsis highlights were added on 5 July 2005.
Circadian clocks involve feedback loops that generate rhythmic expression of key genes. Molecular genetic studies in the higher plant Arabidopsis thaliana have revealed a complex clock network. The first part of the network to be identified, a transcriptional feedback loop comprising TIMING OF CAB EXPRESSION 1 (TOC1), LATE ELONGATED HYPOCOTYL (LHY) and CIRCADIAN CLOCK ASSOCIATED 1 (CCA1), fails to account for significant experimental data. We develop an extended model that is based upon a wider range of data and accurately predicts additional experimental results. The model comprises interlocking feedback loops comparable to those identified experimentally in other circadian systems. We propose that each loop receives input signals from light, and that each loop includes a hypothetical component that had not been explicitly identified. Analysis of the model predicted the properties of these components, including an acute light induction at dawn that is rapidly repressed by LHY and CCA1. We found this unexpected regulation in RNA levels of the evening-expressed gene GIGANTEA (GI), supporting our proposed network and making GI a strong candidate for this component.
doi:10.1038/msb4100018
PMCID: PMC1681447  PMID: 16729048
biological rhythms; gene network; mathematical modelling; parameter estimation
3.  Significance analysis of microarray transcript levels in time series experiments 
BMC Bioinformatics  2007;8(Suppl 1):S10.
Background
Microarray time series studies are essential to understand the dynamics of molecular events. In order to limit the analysis to those genes that change expression over time, a first necessary step is to select differentially expressed transcripts. A variety of methods have been proposed to this purpose; however, these methods are seldom applicable in practice since they require a large number of replicates, often available only for a limited number of samples. In this data-poor context, we evaluate the performance of three selection methods, using synthetic data, over a range of experimental conditions. Application to real data is also discussed.
Results
Three methods are considered, to assess differentially expressed genes in data-poor conditions. Method 1 uses a threshold on individual samples based on a model of the experimental error. Method 2 calculates the area of the region bounded by the time series expression profiles, and considers the gene differentially expressed if the area exceeds a threshold based on a model of the experimental error. These two methods are compared to Method 3, recently proposed in the literature, which exploits splines fit to compare time series profiles. Application of the three methods to synthetic data indicates that Method 2 outperforms the other two both in Precision and Recall when short time series are analyzed, while Method 3 outperforms the other two for long time series.
Conclusion
These results help to address the choice of the algorithm to be used in data-poor time series expression study, depending on the length of the time series.
doi:10.1186/1471-2105-8-S1-S10
PMCID: PMC1885839  PMID: 17430554
4.  A novel computational model of the circadian clock in Arabidopsis that incorporates PRR7 and PRR9 
We developed a mathematical model of the Arabidopsis circadian clock, including PRR7 and PRR9, which is able to predict several single, double and triple mutant phenotypes.Sensitivity Analysis was used to identify the properties and time sensing mechanisms of model structures.PRR7 and CCA1/LHY were identified as weak points of the mathematical model indicating where more experimental data is needed for further model development.Detailed dynamical studies showed that the timing of an evening light sensing element is essential for day length responsiveness
In recent years, molecular genetic techniques have revealed a complex network of components in the Arabidopsis circadian clock. Mathematical models allow for a detailed study of the dynamics and architecture of such complex gene networks leading to a better understanding of the genetic interactions. It is important to maintain a constant iteration with experimentation, to include novel components as they are discovered and use the updated model to design new experiments. This study develops a framework to introduce new components into the mathematical model of the Arabidopsis circadian clock accelerating the iterative model development process and gaining insight into the system's properties.
We used the interlocked feedback loop model published in Locke et al (2005) as the base model. In Arabidopsis, the first suggested regulatory loop involves the morning expressed transcription factors CIRCADIAN CLOCK-ASSOCIATED 1 (CCA1) and LATE ELONGATED HYPOCOTYL (LHY), and the evening expressed pseudo-response regulator TIMING OF CAB EXPRESSION (TOC1). The hypothetical component X had been introduced to realize a longer delay between gene expression of CCA1/LHY and TOC1. The introduction of Y was motivated by the need for a mechanism to reproduce the dampening short period rhythms of the cca1/lhy double mutant and to include an additional light input at the end of the day.
In this study, the new components pseudo-response regulators PRR7 and PRR9 were added in negative feedback loops based on the biological hypothesis that they are activated by LHY and in turn repress LHY transcription (Farré et al, 2005; Figure 1). We present three iterations steps of model development (Figure 1A–C).
A wide range of tools was used to establish and analyze new model structures. One of the challenges facing mathematical modeling of biological processes is parameter identification; they are notoriously difficult to determine experimentally. We established an optimization procedure based on an evolutionary strategy with a cost function mainly derived from wild-type characteristics. This ensured that the model was not restricted by a specific set of parameters and enabled us to use a large set of biological mutant information to assess the predictive capability of the model structure. Models were evaluated by means of an extended phenotype catalogue, allowing for an easy and fair comparison of the structures. We also carried out detailed simulation analysis of component interactions to identify weak points in the structure and suggest further modifications. Finally, we applied sensitivity analysis in a novel manner, using it to direct the model development. Sensitivity analysis provides quantitative measures of robustness; the two measures in this study were the traces of component concentrations over time (classical state sensitivities) and phase behavior (measured by the phase response curve). Three major results emerged from the model development process.
First, the iteration process helped us to learn about general characteristics of the system. We observed that the timing of Y expression is critical for evening light entrainment, which enables the system to respond to changes in day length. This is important for our understanding of the mechanism of light input to the clock and will add in the identification of biological candidates for this function. In addition, our results suggest that a detailed description of the mechanisms of genetic interactions is important for the systems behavior. We observed that the introduction of an experimentally based precise light regulation mechanism on PRR9 expression had a significant effect on the systems behavior.
Second, the final model structure (Figure 1C) was capable of predicting a wide range of mutant phenotypes, such as a reduction of TOC1 expression by RNAi (toc1RNAi), mutations in PRR7 and PRR9 and the novel mutant combinations prr9toc1RNAi and prr7prr9toc1RNAi. However, it was unable to predict the mutations in CCA1 and LHY.
Finally, sensitivity analysis identified the weak points of the system. The developed model structure was heavily based on the TOC1/Y feedback loop. This could explain the model's failure to predict the cca1lhy double mutant phenotype. More detailed information on the regulation of CCA1 and LHY expression will be important to achieve the right balance between the different regulatory loops in the mathematical model. This is in accordance with genetic studies that have identified several genes involved in the regulation of LHY and CCA1 expression. The identification of their mechanism of action will be necessary for the next model development.
In plants, as in animals, the core mechanism to retain rhythmic gene expression relies on the interaction of multiple feedback loops. In recent years, molecular genetic techniques have revealed a complex network of clock components in Arabidopsis. To gain insight into the dynamics of these interactions, new components need to be integrated into the mathematical model of the plant clock. Our approach accelerates the iterative process of model identification, to incorporate new components, and to systematically test different proposed structural hypotheses. Recent studies indicate that the pseudo-response regulators PRR7 and PRR9 play a key role in the core clock of Arabidopsis. We incorporate PRR7 and PRR9 into an existing model involving the transcription factors TIMING OF CAB (TOC1), LATE ELONGATED HYPOCOTYL (LHY) and CIRCADIAN CLOCK ASSOCIATED (CCA1). We propose candidate models based on experimental hypotheses and identify the computational models with the application of an optimization routine. Validation is accomplished through systematic analysis of various mutant phenotypes. We introduce and apply sensitivity analysis as a novel tool for analyzing and distinguishing the characteristics of proposed architectures, which also allows for further validation of the hypothesized structures.
doi:10.1038/msb4100101
PMCID: PMC1682023  PMID: 17102803
Arabidopsis; circadian rhythms; mathematical modeling; parameter optimization; sensitivity analysis
5.  An approach for clustering gene expression data with error information 
BMC Bioinformatics  2006;7:17.
Background
Clustering of gene expression patterns is a well-studied technique for elucidating trends across large numbers of transcripts and for identifying likely co-regulated genes. Even the best clustering methods, however, are unlikely to provide meaningful results if too much of the data is unreliable. With the maturation of microarray technology, a wealth of research on statistical analysis of gene expression data has encouraged researchers to consider error and uncertainty in their microarray experiments, so that experiments are being performed increasingly with repeat spots per gene per chip and with repeat experiments. One of the challenges is to incorporate the measurement error information into downstream analyses of gene expression data, such as traditional clustering techniques.
Results
In this study, a clustering approach is presented which incorporates both gene expression values and error information about the expression measurements. Using repeat expression measurements, the error of each gene expression measurement in each experiment condition is estimated, and this measurement error information is incorporated directly into the clustering algorithm. The algorithm, CORE (Clustering Of Repeat Expression data), is presented and its performance is validated using statistical measures. By using error information about gene expression measurements, the clustering approach is less sensitive to noise in the underlying data and it is able to achieve more accurate clusterings. Results are described for both synthetic expression data as well as real gene expression data from Escherichia coli and Saccharomyces cerevisiae.
Conclusion
The additional information provided by replicate gene expression measurements is a valuable asset in effective clustering. Gene expression profiles with high errors, as determined from repeat measurements, may be unreliable and may associate with different clusters, whereas gene expression profiles with low errors can be clustered with higher specificity. Results indicate that including error information from repeat gene expression measurements can lead to significant improvements in clustering accuracy.
doi:10.1186/1471-2105-7-17
PMCID: PMC1360687  PMID: 16409635
6.  Quantitative analysis of regulatory flexibility under changing environmental conditions 
Day length changes with the seasons in temperate latitudes, affecting the many biological rhythms that entrain to the day/night cycle: we measure these effects on the expression of Arabidopsis clock genes, using RNA and reporter gene readouts, with a new method of phase analysis.Dusk sensitivity is proposed as a simple, natural and general mathematical measure to analyse and manipulate the changing phase of a clock output relative to the change in the day/night cycle.Dusk sensitivity shows how increasing the numbers of feedback loops in the Arabidopsis clock models allows more flexible regulation, consistent with a previously-proposed, general operating principle of biological networks.The Arabidopsis clock genes show flexibility of regulation that is characteristic of a three-loop clock model, validating aspects of the model and the operating principle, but some clock output genes show greater flexibility arising from direct light regulation.
The analysis of dynamic, non-linear regulation with the aid of mechanistic models is central to Systems Biology. This study compares the predictions of mechanistic, mathematical models of the circadian clock with molecular time-series data on rhythmic gene expression in the higher plant Arabidopsis thaliana. Analysis of the models helps us to understand (explain and predict) how the clock gene circuit balances regulation by external and endogenous factors to achieve particular behaviours. Such multi-factorial regulation is ubiquitous in, and characteristic of, living systems.
The Earth's rotation causes predictable changes in the environment, notably in the availability of sunlight for photosynthesis. Many biological processes are driven by the environmental input via sensory pathways, for example, from photoreceptors. Circadian clocks provide an alternative strategy. These endogenous, 24-h rhythms can drive biological processes that anticipate the regular environmental changes, rather than merely responding. Many rhythmic processes have both light and clock control. Indeed, the clock components themselves must balance internal timing with external inputs, because circadian clocks are reset daily through light regulation of one or more clock components. This process of entrainment is complicated by the change in day length. When the times of dawn and dusk move apart in summer, and closer together in winter, does the clock track dawn, track dusk or interpolate between them?
In plants, the clock controls leaf and petal movements, the opening and closing of stomatal pores, the discharge of floral fragrances, and many metabolic activities, especially those associated with photosynthesis. Centuries of physiological studies have shown that these rhythms can behave differently. Flowering in Ipomoea nil (Pharbitis nil, Japanese morning glory) is controlled by a rhythm that tracks the time of dusk, to give a classic example. We showed that two other rhythms associated with vegetative growth track dawn in this species (Figure 5A), so the clock system allows flexible regulation.
The relatively small number of components involved in the circadian clockwork makes it an ideal candidate for mathematical modelling. Molecular genetic studies in a variety of model eukaryotes have shown that the circadian rhythm is generated by a network of 6–20 genes. These genes form feedback loops generating a rhythm in mRNA production. A single negative feedback loop in which a gene encodes a protein that, after several hours, turns off transcription is capable of generating a circadian rhythm, in principle. A single light input can entrain the clock to ‘local time', synchronised with a light–dark cycle. However, real circadian clocks have proven to be more complicated than this, with multiple light inputs and interlocked feedback loops.
We have previously argued from mathematical analysis that multi-loop networks increase the flexibility of regulation (Rand et al, 2004) and have shown that appropriately deployed flexibility can confer functional robustness (Akman et al, 2010). Here we test whether that flexibility can be demonstrated in vivo, in the model plant, A. thaliana. The Arabidopsis clock mechanism comprises a feedback loop in which two partially redundant, myb transcription factors, LATE ELONGATED HYPOCOTYL (LHY) and CIRCADIAN CLOCK ASSOCIATED 1 (CCA1), repress the expression of their activator, TIMING OF CAB EXPRESSION 1 (TOC1). We previously modelled this single-loop circuit and showed that it was not capable of recreating important data (Locke et al, 2005a). An extended, two-loop model was developed to match observed behaviours, incorporating a hypothetical gene Y, for which the best identified candidate was the GIGANTEA gene (GI) (Locke et al, 2005b). Two further models incorporated the TOC1 homologues PSEUDO-RESPONSE REGULATOR (PRR) 9 and PRR7 (Locke et al, 2006; Zeilinger et al, 2006). In these circuits, a morning oscillator (LHY/CCA1–PRR9/7) is coupled to an evening oscillator (Y/GI–TOC1) via the original LHY/CCA1–TOC1 loop.
These clock models, like those for all other organisms, were developed using data from simple conditions of constant light, darkness or 12-h light–12-h dark cycles. We therefore tested how the clock genes in Arabidopsis responded to light–dark cycles with different photoperiods, from 3 h light to 18 h light per 24-h cycle (Edinburgh, 56° North latitude, has 17.5 h light in midsummer). The time-series assays of mRNA and in vivo reporter gene images showed a range of peak times for different genes, depending on the photoperiod (Figure 5C). A new data analysis method, mFourfit, was introduced to measure the peak times, in the Biological Rhythms Analysis Software Suite (BRASS v3.0). None of the genes showed the dusk-tracking behaviour characteristic of the Ipomoea flowering rhythm. The one-, two- and three-loop models were analysed to understand the observed patterns. A new mathematical measure, dusk sensitivity, was introduced to measure the change in timing of a model component versus a change in the time of dusk. The one- and two-loop models tracked dawn and dusk, respectively, under all conditions. Only the three-loop model (Figure 5B) had the flexibility required to match the photoperiod-dependent changes that we found in vivo, and in particular the unexpected, V-shaped pattern in the peak time of TOC1 expression. This pattern of regulation depends on the structure and light inputs to the model's evening oscillator, so the in vivo data supported this aspect of the model. LHY and CCA1 gene expression under short photoperiods showed greater dusk sensitivity, in the interval 2–6 h before dawn, than the three-loop model predicted, so these data will help to constrain future models.
The approach described here could act as a template for experimental biologists seeking to understand biological regulation using dynamic, experimental perturbations and time-series data. Simulation of mathematical models (despite known imperfections) can provide contrasting hypotheses that guide understanding. The system's detailed behaviour is complex, so a natural and general measure such as dusk sensitivity is helpful to focus on one property of the system. We used the measure to compare models, and to predict how this property could be manipulated. To enable additional analysis of this system, we provide the time-series data and experimental metadata online.
The circadian clock controls 24-h rhythms in many biological processes, allowing appropriate timing of biological rhythms relative to dawn and dusk. Known clock circuits include multiple, interlocked feedback loops. Theory suggested that multiple loops contribute the flexibility for molecular rhythms to track multiple phases of the external cycle. Clear dawn- and dusk-tracking rhythms illustrate the flexibility of timing in Ipomoea nil. Molecular clock components in Arabidopsis thaliana showed complex, photoperiod-dependent regulation, which was analysed by comparison with three contrasting models. A simple, quantitative measure, Dusk Sensitivity, was introduced to compare the behaviour of clock models with varying loop complexity. Evening-expressed clock genes showed photoperiod-dependent dusk sensitivity, as predicted by the three-loop model, whereas the one- and two-loop models tracked dawn and dusk, respectively. Output genes for starch degradation achieved dusk-tracking expression through light regulation, rather than a dusk-tracking rhythm. Model analysis predicted which biochemical processes could be manipulated to extend dusk tracking. Our results reveal how an operating principle of biological regulators applies specifically to the plant circadian clock.
doi:10.1038/msb.2010.81
PMCID: PMC3010117  PMID: 21045818
Arabidopsis thaliana; biological clocks; dynamical systems; gene regulatory networks; mathematical models; photoperiodism
7.  Mathematical Modelling of DNA Replication Reveals a Trade-off between Coherence of Origin Activation and Robustness against Rereplication 
PLoS Computational Biology  2010;6(5):e1000783.
Eukaryotic genomes are duplicated from multiple replication origins exactly once per cell cycle. In Saccharomyces cerevisiae, a complex molecular network has been identified that governs the assembly of the replication machinery. Here we develop a mathematical model that links the dynamics of this network to its performance in terms of rate and coherence of origin activation events, number of activated origins, the resulting distribution of replicon sizes and robustness against DNA rereplication. To parameterize the model, we use measured protein expression data and systematically generate kinetic parameter sets by optimizing the coherence of origin firing. While randomly parameterized networks yield unrealistically slow kinetics of replication initiation, networks with optimized parameters account for the experimentally observed distribution of origin firing times. Efficient inhibition of DNA rereplication emerges as a constraint that limits the rate at which replication can be initiated. In addition to the separation between origin licensing and firing, a time delay between the activation of S phase cyclin-dependent kinase (S-Cdk) and the initiation of DNA replication is required for preventing rereplication. Our analysis suggests that distributive multisite phosphorylation of the S-Cdk targets Sld2 and Sld3 can generate both a robust time delay and contribute to switch-like, coherent activation of replication origins. The proposed catalytic function of the complex formed by Dpb11, Sld3 and Sld2 strongly enhances coherence and robustness of origin firing. The model rationalizes how experimentally observed inefficient replication from fewer origins is caused by premature activation of S-Cdk, while premature activity of the S-Cdk targets Sld2 and Sld3 results in DNA rereplication. Thus the model demonstrates how kinetic deregulation of the molecular network governing DNA replication may result in genomic instability.
Author Summary
For a cell to divide into two daughter cells, its genetic information must be accurately duplicated. The large genomes in eukaryotic cells are copied from hundreds or thousands of replication origins to achieve the duplication of the entire DNA in a limited time span. Errors that result in incomplete or multiple copying of parts of the genome can cause cancer in humans. To avoid such errors, the replication origins must be activated coherently across the genome, and repeated firing of already activated origins must be strictly prevented. We developed a kinetic model of the biochemical network that governs the initiation of DNA replication in yeast to understand how these functional properties are realized through the interaction of multiple molecular players. Our computational analysis shows that optimized kinetic parameters are required for the biological functionality of the network, and such parameters indeed account for the measured kinetics of replication initiation. We predict that both the near-synchronous start of replication and the robustness against DNA rereplication are supported by time delays caused by multiple regulatory protein phosphorylations. Our analysis suggests that the kinetic design of the DNA replication network represents an adaptation to multiple, and partially conflicting, functional requirements.
doi:10.1371/journal.pcbi.1000783
PMCID: PMC2869307  PMID: 20485558
8.  Information criterion-based clustering with order-restricted candidate profiles in short time-course microarray experiments 
BMC Bioinformatics  2009;10:146.
Background
Time-course microarray experiments produce vector gene expression profiles across a series of time points. Clustering genes based on these profiles is important in discovering functional related and co-regulated genes. Early developed clustering algorithms do not take advantage of the ordering in a time-course study, explicit use of which should allow more sensitive detection of genes that display a consistent pattern over time. Peddada et al. [1] proposed a clustering algorithm that can incorporate the temporal ordering using order-restricted statistical inference. This algorithm is, however, very time-consuming and hence inapplicable to most microarray experiments that contain a large number of genes. Its computational burden also imposes difficulty to assess the clustering reliability, which is a very important measure when clustering noisy microarray data.
Results
We propose a computationally efficient information criterion-based clustering algorithm, called ORICC, that also takes account of the ordering in time-course microarray experiments by embedding the order-restricted inference into a model selection framework. Genes are assigned to the profile which they best match determined by a newly proposed information criterion for order-restricted inference. In addition, we also developed a bootstrap procedure to assess ORICC's clustering reliability for every gene. Simulation studies show that the ORICC method is robust, always gives better clustering accuracy than Peddada's method and saves hundreds of times computational time. Under some scenarios, its accuracy is also better than some other existing clustering methods for short time-course microarray data, such as STEM [2] and Wang et al. [3]. It is also computationally much faster than Wang et al. [3].
Conclusion
Our ORICC algorithm, which takes advantage of the temporal ordering in time-course microarray experiments, provides good clustering accuracy and is meanwhile much faster than Peddada's method. Moreover, the clustering reliability for each gene can also be assessed, which is unavailable in Peddada's method. In a real data example, the ORICC algorithm identifies new and interesting genes that previous analyses failed to reveal.
doi:10.1186/1471-2105-10-146
PMCID: PMC2696449  PMID: 19445669
9.  Comparative analysis of acute and chronic corticosteroid pharmacogenomic effects in rat liver: Transcriptional dynamics and regulatory structures 
BMC Bioinformatics  2010;11:515.
Background
Comprehensively understanding corticosteroid pharmacogenomic effects is an essential step towards an insight into the underlying molecular mechanisms for both beneficial and detrimental clinical effects. Nevertheless, even in a single tissue different methods of corticosteroid administration can induce different patterns of expression and regulatory control structures. Therefore, rich in vivo datasets of pharmacological time-series with two dosing regimens sampled from rat liver are examined for temporal patterns of changes in gene expression and their regulatory commonalities.
Results
The study addresses two issues, including (1) identifying significant transcriptional modules coupled with dynamic expression patterns and (2) predicting relevant common transcriptional controls to better understand the underlying mechanisms of corticosteroid adverse effects. Following the orientation of meta-analysis, an extended computational approach that explores the concept of agreement matrix from consensus clustering has been proposed with the aims of identifying gene clusters that share common expression patterns across multiple dosing regimens as well as handling challenges in the analysis of microarray data from heterogeneous sources, e.g. different platforms and time-grids in this study. Six significant transcriptional modules coupled with typical patterns of expression have been identified. Functional analysis reveals that virtually all enriched functions (gene ontologies, pathways) in these modules are shown to be related to metabolic processes, implying the importance of these modules in adverse effects under the administration of corticosteroids. Relevant putative transcriptional regulators (e.g. RXRF, FKHD, SP1F) are also predicted to provide another source of information towards better understanding the complexities of expression patterns and the underlying regulatory mechanisms of those modules.
Conclusions
We have proposed a framework to identify significant coexpressed clusters of genes across multiple conditions experimented from different microarray platforms, time-grids, and also tissues if applicable. Analysis on rich in vivo datasets of corticosteroid time-series yielded significant insights into the pharmacogenomic effects of corticosteroids, especially the relevance to metabolic side-effects. This has been illustrated through enriched metabolic functions in those transcriptional modules and the presence of GRE binding motifs in those enriched pathways, providing significant modules for further analysis on pharmacogenomic corticosteroid effects.
doi:10.1186/1471-2105-11-515
PMCID: PMC2973961  PMID: 20946642
10.  Data assimilation constrains new connections and components in a complex, eukaryotic circadian clock model 
Integrating molecular time-series data resulted in a more robust model of the plant clock, which predicts that a wave of inhibitory PRR proteins controls the morning genes LHY and CCA1.PRR5 is experimentally validated as a late-acting component of this wave.The family of sequentially expressed PRR proteins allows flexible entrainment of the clock, whereas a single protein could not, suggesting that the duplication of clock genes might confer this generic, functional advantage.The observed post-translational regulation of the evening protein TOC1 by interaction with ZTL and GI remains consistent with an indirect activation of TOC1 mRNA expression by GI, which was previously postulated from modelling.
Circadian rhythms are present in most eukaryotic organisms including plants. The core genes of the circadian clock are very important for plant physiology as they drive the rhythmic expression of around 30% of Arabidopsis genes (Edwards et al, 2006; Michael et al, 2008). The clock is normally entrained by daily environmental changes in light and temperature. Oscillations also persist under constant environmental conditions in a laboratory. The clock gene circuit in Arabidopsis is based on multiple interlocked feedback loops, which are typical of circadian genetic networks in other organisms (Dunlap and Loros, 2004; Bell-Pedersen et al, 2005). Mechanistic, mathematical models are increasingly useful in analysing and understanding how the observed molecular components give rise to the rhythmic behaviour of this dynamic, non-linear system.
Our previous model of Arabidopsis circadian clock (Locke et al, 2006) presented the core, three-loop structure of the clock, which comprised morning and evening oscillators and coupling between them (Figure 1). The morning loop included the dawn-expressed LATE ELONGATED HYPOCOTYL (LHY) and CIRCADIAN CLOCK ASSOCIATED 1 (CCA1) genes, which negatively regulate their expression through activation of the inhibitor proteins, PSEUDO-RESPONSE REGULATOR 9 (PRR9) and PRR7. These were described by a single, combined model component, PRR9/7. The evening loop included the dusk-expressed gene TIMING OF CAB EXPRESSION 1 (TOC1), which negatively regulates itself through inhibition of a hypothetical activator, gene Y. The evening-expressed gene GIGANTEA (GI) contributes to Y function. The morning and evening loops were connected through inhibition of the evening genes by LHY/CCA1 and activation of LHY/CCA1 expression by a hypothetical evening gene X. Here, we extend the previous model of circadian gene expression (Locke et al, 2006) based on recently published data (Figure 1). The new model retains the good match of our previous model to the large volume of molecular time-series data, and improves the behaviour of the model clock system under a range of light conditions and in a wider range of mutants.
The morning loop was extended by adding a hypothetical clock component, the night inhibitor (NI), which acts together with PRR9 and PRR7 to keep the expression of LHY and CCA1 at low levels over a broad interval spanning dusk. This regulation is important to set the phase of LHY/CCA1 expression at dawn. Data from the literature suggested that the PRR5 gene was a candidate for NI, leading us to predict that the sequentially expressed PRR9, PRR7 and PRR5 proteins together formed a wave of inhibitors of LHY and CCA1. This hypothesis was tested under discriminating light conditions, in which the light interval is replaced with the dawn and dusk pulses of light to form a ‘skeleton photoperiod'. Combining this protocol with mutation of the PRR7 and/or PRR5 genes, our new experimental results validated the model predictions and confirmed that PRR5 contributes to the function that we modelled as NI. During revision of this paper, that result received further experimental support (Nakamichi et al, 2010).
Model simulations revealed the functional importance of the inhibitor wave in entraining the clock to the light/dark cycle. Separating PRR9 from the other inhibitors in the model showed how the strong light activation observed for this gene contributes to more rapid entrainment. The observed, post-translation regulation of all three inhibitor proteins by light (Farre and Kay, 2007; Ito et al, 2007; Kiba et al, 2007) was also included in the model. Light-regulated degradation provides a molecular mechanism to explain the later phase of LHY and CCA1 expression under long photoperiods compared with short photoperiods, in line with experimental observations.
The connection between evening and morning loops was revised by including the inhibition of the morning gene PRR9 by the evening component TOC1, based on the data on TOC1-overexpressing plants (Makino et al, 2002; Ito et al, 2005). This inhibition causes a delay of PRR9 expression relative to LHY/CCA1, which allows LHY/CCA1 to reach a higher expression level at dawn. Our simulations showed that a partial mutant that lacks this inhibition of PRR9 by TOC1 is sufficient to cause the higher level of PRR9 and the short circadian period observed in toc1 mutant plants.
The evening loop was extended by introducing the observed, post-translational regulation of the TOC1 protein by the F-box protein ZEITLUPE (ZTL) and stabilization of ZTL by its interaction with GI in the presence of light (Kim et al, 2007). GI's function in the clock model has thus been revised according to the data: GI promotes an inhibition of TOC1 protein function through positive regulation of ZTL. This results, together with negative regulation of Y by TOC1, in indirect activation of TOC1 mRNA expression by GI, which agrees with our earlier experimental data (Locke et al, 2006). Simulations showed that the post-translational regulation of TOC1 by ZTL and GI results in the observed long period of the ztl mutant and fast dampening of rhythms in the lhy/cca1/gi triple mutant.
This is the first mathematical model that incorporates the observed post-translational regulation into the genetic network of the Arabidopsis clock. In addition to specific, mechanistic insights, the model shows a generic advantage from the duplication of clock genes and their expression at different phases. Such clock gene duplications are observed in eukaryotes with larger genomes, such as the mouse. Analogous, functional duplication can be achieved by differential regulation of a single clock gene in distinct cells, as in Drosophila.
Circadian clocks generate 24-h rhythms that are entrained by the day/night cycle. Clock circuits include several light inputs and interlocked feedback loops, with complex dynamics. Multiple biological components can contribute to each part of the circuit in higher organisms. Mechanistic models with morning, evening and central feedback loops have provided a heuristic framework for the clock in plants, but were based on transcriptional control. Here, we model observed, post-transcriptional and post-translational regulation and constrain many parameter values based on experimental data. The model's feedback circuit is revised and now includes PSEUDO-RESPONSE REGULATOR 7 (PRR7) and ZEITLUPE. The revised model matches data in varying environments and mutants, and gains robustness to parameter variation. Our results suggest that the activation of important morning-expressed genes follows their release from a night inhibitor (NI). Experiments inspired by the new model support the predicted NI function and show that the PRR5 gene contributes to the NI. The multiple PRR genes of Arabidopsis uncouple events in the late night from light-driven responses in the day, increasing the flexibility of rhythmic regulation.
doi:10.1038/msb.2010.69
PMCID: PMC2964123  PMID: 20865009
Arabidopsis thaliana; biological clocks; circadian rhythms; mathematical model; systems biology
11.  Circadian Clocks Are Resounding in Peripheral Tissues 
PLoS Computational Biology  2006;2(3):e16.
Circadian rhythms are prevalent in most organisms. Even the smallest disturbances in the orchestration of circadian gene expression patterns among different tissues can result in functional asynchrony, at the organism level, and may to contribute to a wide range of physiologic disorders. It has been reported that as many as 5%–10% of transcribed genes in peripheral tissues follow a circadian expression pattern. We have conducted a comprehensive study of circadian gene expression on a large dataset representing three different peripheral tissues. The data have been produced in a large-scale microarray experiment covering replicate daily cycles in murine white and brown adipose tissues as well as in liver. We have applied three alternative algorithmic approaches to identify circadian oscillation in time series expression profiles. Analyses of our own data indicate that the expression of at least 7% to 21% of active genes in mouse liver, and in white and brown adipose tissues follow a daily oscillatory pattern. Indeed, analysis of data from other laboratories suggests that the percentage of genes with an oscillatory pattern may approach 50% in the liver. For the rest of the genes, oscillation appears to be obscured by stochastic noise. Our phase classification and computer simulation studies based on multiple datasets indicate no detectable boundary between oscillating and non-oscillating fractions of genes. We conclude that greater attention should be given to the potential influence of circadian mechanisms on any biological pathway related to metabolism and obesity.
Synopsis
The metabolism of living organisms changes over the twenty-four hour daily cycle in an oscillatory manner. This repeating pattern of “peak” and “trough” expression is known as a “circadian rhythm.” We now know that the body's internal clock is controlled by a discrete group of genes. These important regulators are found in many different organs of the body, and they control expression of many other genes in the body. Using mice as an experimental animal, Ptitsyn and colleagues looked at the overall pattern of gene expression in fat tissues and the liver using three different mathematical tests. They present data indicating that the majority of active genes fluctuate rhythmically over a twenty-four hour period. This work suggests that future studies should pay close attention to the influence of the circadian rhythm in obesity and in fat metabolism.
doi:10.1371/journal.pcbi.0020016
PMCID: PMC1391918  PMID: 16532060
12.  A recursively partitioned mixture model for clustering time-course gene expression data 
Translational cancer research  2014;3(3):217-232.
Background
Longitudinally collected gene expression data provides an opportunity to investigate the dynamic behavior of gene expression and is crucial for establishing causal links between changes on a molecular level and disease development and progression. In terms of the analysis of such data, clustering of subjects based on time-course expression data may improve our understanding of temporal expression patterns that result in disease phenotypes. Although there are numerous existing methods for clustering subjects using gene expression data, most are not suitable when expression measurements are repeatedly collected over a time-course.
Methods
We present a modified version of the recursively partitioned mixture model (RPMM) for clustering subjects based on longitudinally collected gene expression data. In the proposed time-course RPMM (TC-RPMM), subjects are clustered on the basis of their temporal profiles of gene expression using a mixture of mixed effects models framework. This framework captures changes in gene expression over time and models the autocorrelation between repeated gene expression measurements for the same subject. We assessed the performance of TC-RPMM using extensive simulation studies and a dataset from a multi-center research study of inflammation and response to injury (www.gluegrant.org), which consisted of time-course gene expression data for 140 subjects.
Results
Our simulation studies encompassed several different scenarios and were aimed at assessing the ability of TC-RPMM to correctly recover true class memberships when the expression trajectories that characterized those classes differed. Overall, our simulation studies revealed favorable performance of TC-RPMM compared to competing approaches, however clustering performance was observed to be highly dependent on the proportion of class discriminating genes used in clustering analysis. When applied to real epidemiologic data with repeated-measures, longitudinal gene expression measurements, TC-RPMM identified clusters that had strong biological and clinical significance.
Conclusions
Methods for clustering subjects based on temporal gene expression profiles is a high priority for molecular biology and bioinformatics research. Along these lines, the proposed TC-RPMM represents a promising new approach for analyzing time-course gene expression data.
doi:10.3978/j.issn.2218-676X.2014.06.04
PMCID: PMC4208690  PMID: 25346887
Longitudinal gene expression data; repeated-measures microarrays; time-course microarrays; clustering; mixture models
13.  A method to identify differential expression profiles of time-course gene data with Fourier transformation 
BMC Bioinformatics  2013;14:310.
Background
Time course gene expression experiments are an increasingly popular method for exploring biological processes. Temporal gene expression profiles provide an important characterization of gene function, as biological systems are both developmental and dynamic. With such data it is possible to study gene expression changes over time and thereby to detect differential genes. Much of the early work on analyzing time series expression data relied on methods developed originally for static data and thus there is a need for improved methodology. Since time series expression is a temporal process, its unique features such as autocorrelation between successive points should be incorporated into the analysis.
Results
This work aims to identify genes that show different gene expression profiles across time. We propose a statistical procedure to discover gene groups with similar profiles using a nonparametric representation that accounts for the autocorrelation in the data. In particular, we first represent each profile in terms of a Fourier basis, and then we screen out genes that are not differentially expressed based on the Fourier coefficients. Finally, we cluster the remaining gene profiles using a model-based approach in the Fourier domain. We evaluate the screening results in terms of sensitivity, specificity, FDR and FNR, compare with the Gaussian process regression screening in a simulation study and illustrate the results by application to yeast cell-cycle microarray expression data with alpha-factor synchronization.
The key elements of the proposed methodology: (i) representation of gene profiles in the Fourier domain; (ii) automatic screening of genes based on the Fourier coefficients and taking into account autocorrelation in the data, while controlling the false discovery rate (FDR); (iii) model-based clustering of the remaining gene profiles.
Conclusions
Using this method, we identified a set of cell-cycle-regulated time-course yeast genes. The proposed method is general and can be potentially used to identify genes which have the same patterns or biological processes, and help facing the present and forthcoming challenges of data analysis in functional genomics.
doi:10.1186/1471-2105-14-310
PMCID: PMC4015127  PMID: 24134721
14.  Inference of Gene Regulatory Networks Incorporating Multi-Source Biological Knowledge via a State Space Model with L1 Regularization 
PLoS ONE  2014;9(8):e105942.
Comprehensive understanding of gene regulatory networks (GRNs) is a major challenge in the field of systems biology. Currently, there are two main approaches in GRN analysis using time-course observation data, namely an ordinary differential equation (ODE)-based approach and a statistical model-based approach. The ODE-based approach can generate complex dynamics of GRNs according to biologically validated nonlinear models. However, it cannot be applied to ten or more genes to simultaneously estimate system dynamics and regulatory relationships due to the computational difficulties. The statistical model-based approach uses highly abstract models to simply describe biological systems and to infer relationships among several hundreds of genes from the data. However, the high abstraction generates false regulations that are not permitted biologically. Thus, when dealing with several tens of genes of which the relationships are partially known, a method that can infer regulatory relationships based on a model with low abstraction and that can emulate the dynamics of ODE-based models while incorporating prior knowledge is urgently required. To accomplish this, we propose a method for inference of GRNs using a state space representation of a vector auto-regressive (VAR) model with L1 regularization. This method can estimate the dynamic behavior of genes based on linear time-series modeling constructed from an ODE-based model and can infer the regulatory structure among several tens of genes maximizing prediction ability for the observational data. Furthermore, the method is capable of incorporating various types of existing biological knowledge, e.g., drug kinetics and literature-recorded pathways. The effectiveness of the proposed method is shown through a comparison of simulation studies with several previous methods. For an application example, we evaluated mRNA expression profiles over time upon corticosteroid stimulation in rats, thus incorporating corticosteroid kinetics/dynamics, literature-recorded pathways and transcription factor (TF) information.
doi:10.1371/journal.pone.0105942
PMCID: PMC4146587  PMID: 25162401
15.  Building Disease-Specific Drug-Protein Connectivity Maps from Molecular Interaction Networks and PubMed Abstracts 
PLoS Computational Biology  2009;5(7):e1000450.
The recently proposed concept of molecular connectivity maps enables researchers to integrate experimental measurements of genes, proteins, metabolites, and drug compounds under similar biological conditions. The study of these maps provides opportunities for future toxicogenomics and drug discovery applications. We developed a computational framework to build disease-specific drug-protein connectivity maps. We integrated gene/protein and drug connectivity information based on protein interaction networks and literature mining, without requiring gene expression profile information derived from drug perturbation experiments on disease samples. We described the development and application of this computational framework using Alzheimer's Disease (AD) as a primary example in three steps. First, molecular interaction networks were incorporated to reduce bias and improve relevance of AD seed proteins. Second, PubMed abstracts were used to retrieve enriched drug terms that are indirectly associated with AD through molecular mechanistic studies. Third and lastly, a comprehensive AD connectivity map was created by relating enriched drugs and related proteins in literature. We showed that this molecular connectivity map development approach outperformed both curated drug target databases and conventional information retrieval systems. Our initial explorations of the AD connectivity map yielded a new hypothesis that diltiazem and quinidine may be investigated as candidate drugs for AD treatment. Molecular connectivity maps derived computationally can help study molecular signature differences between different classes of drugs in specific disease contexts. To achieve overall good data coverage and quality, a series of statistical methods have been developed to overcome high levels of data noise in biological networks and literature mining results. Further development of computational molecular connectivity maps to cover major disease areas will likely set up a new model for drug development, in which therapeutic/toxicological profiles of candidate drugs can be checked computationally before costly clinical trials begin.
Author Summary
Molecular connectivity maps between drugs and a wide range of bio-molecular entities can help researchers to study and compare the molecular therapeutic/toxicological profiles of many candidate drugs. Recent studies in this area have focused on linking drug molecules and genes in specific disease contexts using drug-perturbed gene expression experiments, which can be costly and time-consuming to derive. In this paper, we developed a computational framework to build disease-specific drug-protein connectivity maps, by mining molecular interaction networks and PubMed abstracts. Using Alzheimer's Disease (AD) as a case study, we described how drug-protein molecular connectivity maps can be constructed to overcome data coverage and noise issues inherent in automatically extracted results. We showed that this new approach outperformed both curated drug target databases and conventional text mining systems in retrieving disease-related drugs, with an overall balanced performance of sensitivity, specificity, and positive predictive values. The AD molecular connectivity map contained novel information on AD-related genes/proteins, AD candidate drugs, and protein therapeutic/toxicological profiles of all the AD candidate drugs. Bi-clustering of the molecular connectivity map revealed interesting patterns of functionally similar proteins and drugs, therefore creating new opportunities for future drug development applications.
doi:10.1371/journal.pcbi.1000450
PMCID: PMC2709445  PMID: 19649302
16.  Identifying regulational alterations in gene regulatory networks by state space representation of vector autoregressive models and variational annealing 
BMC Genomics  2012;13(Suppl 1):S6.
Background
In the analysis of effects by cell treatment such as drug dosing, identifying changes on gene network structures between normal and treated cells is a key task. A possible way for identifying the changes is to compare structures of networks estimated from data on normal and treated cells separately. However, this approach usually fails to estimate accurate gene networks due to the limited length of time series data and measurement noise. Thus, approaches that identify changes on regulations by using time series data on both conditions in an efficient manner are demanded.
Methods
We propose a new statistical approach that is based on the state space representation of the vector autoregressive model and estimates gene networks on two different conditions in order to identify changes on regulations between the conditions. In the mathematical model of our approach, hidden binary variables are newly introduced to indicate the presence of regulations on each condition. The use of the hidden binary variables enables an efficient data usage; data on both conditions are used for commonly existing regulations, while for condition specific regulations corresponding data are only applied. Also, the similarity of networks on two conditions is automatically considered from the design of the potential function for the hidden binary variables. For the estimation of the hidden binary variables, we derive a new variational annealing method that searches the configuration of the binary variables maximizing the marginal likelihood.
Results
For the performance evaluation, we use time series data from two topologically similar synthetic networks, and confirm that our proposed approach estimates commonly existing regulations as well as changes on regulations with higher coverage and precision than other existing approaches in almost all the experimental settings. For a real data application, our proposed approach is applied to time series data from normal Human lung cells and Human lung cells treated by stimulating EGF-receptors and dosing an anticancer drug termed Gefitinib. In the treated lung cells, a cancer cell condition is simulated by the stimulation of EGF-receptors, but the effect would be counteracted due to the selective inhibition of EGF-receptors by Gefitinib. However, gene expression profiles are actually different between the conditions, and the genes related to the identified changes are considered as possible off-targets of Gefitinib.
Conclusions
From the synthetically generated time series data, our proposed approach can identify changes on regulations more accurately than existing methods. By applying the proposed approach to the time series data on normal and treated Human lung cells, candidates of off-target genes of Gefitinib are found. According to the published clinical information, one of the genes can be related to a factor of interstitial pneumonia, which is known as a side effect of Gefitinib.
doi:10.1186/1471-2164-13-S1-S6
PMCID: PMC3587380  PMID: 22369122
17.  Transcriptional Regulation of Lineage Commitment - A Stochastic Model of Cell Fate Decisions 
PLoS Computational Biology  2013;9(8):e1003197.
Molecular mechanisms employed by individual multipotent cells at the point of lineage commitment remain largely uncharacterized. Current paradigms span from instructive to noise-driven mechanisms. Of considerable interest is also whether commitment involves a limited set of genes or the entire transcriptional program, and to what extent gene expression configures multiple trajectories into commitment. Importantly, the transient nature of the commitment transition confounds the experimental capture of committing cells. We develop a computational framework that simulates stochastic commitment events, and affords mechanistic exploration of the fate transition. We use a combined modeling approach guided by gene expression classifier methods that infers a time-series of stochastic commitment events from experimental growth characteristics and gene expression profiling of individual hematopoietic cells captured immediately before and after commitment. We define putative regulators of commitment and probabilistic rules of transition through machine learning methods, and employ clustering and correlation analyses to interrogate gene regulatory interactions in multipotent cells. Against this background, we develop a Monte Carlo time-series stochastic model of transcription where the parameters governing promoter status, mRNA production and mRNA decay in multipotent cells are fitted to experimental static gene expression distributions. Monte Carlo time is converted to physical time using cell culture kinetic data. Probability of commitment in time is a function of gene expression as defined by a logistic regression model obtained from experimental single-cell expression data. Our approach should be applicable to similar differentiating systems where single cell data is available. Within our system, we identify robust model solutions for the multipotent population within physiologically reasonable values and explore model predictions with regard to molecular scenarios of entry into commitment. The model suggests distinct dependencies of different commitment-associated genes on mRNA dynamics and promoter activity, which globally influence the probability of lineage commitment.
Author Summary
Stem cells have the capacity to both self-renew and differentiate into specialized cell lineages, thus sustaining tissue formation during embryonic development and permitting tissue homeostasis throughout adult life. Previous studies have suggested that stem cell commitment to a specific lineage may constitute a discrete event of stochastic activation of a small number of key regulator genes. Experimental exploration of this question is challenging, in face of the elusive nature of the commitment transition and due to considerable gene expression heterogeneity between cells. Here, we implement a computational model that simulates gene expression variation through time and affords the capture of in silico commitment events. This model integrates statistical analysis of experimental single-cell gene expression data with dynamical modeling methods to implement a mechanistic framework for stochastic regulation of gene transcription and a probabilistic approach for the commitment rules. Applied to blood cells, our method identifies potential commitment-associated genes, explores how their expression patterns can define alternative commitment regimes, and suggests how differences in regulation of gene expression dynamics can impact the frequency of commitment.
doi:10.1371/journal.pcbi.1003197
PMCID: PMC3749951  PMID: 23990771
18.  Division of labor by dual feedback regulators controls JAK2/STAT5 signaling over broad ligand range 
Quantitative analysis of time-resolved data in primary erythroid progenitor cells reveals that a dual negative transcriptional feedback mechanism underlies the ability of STAT5 to respond to the broad spectrum of physiologically relevant Epo concentrations.
A mathematical dual feedback model of the Epo-induced JAK2/STAT5 signaling pathway was calibrated with extensive time-resolved quantitative data sets from immunoblotting, mass spectrometry and qRT–PCR experiments in primary erythroid progenitor cells.We show that the amount of nuclear phosphorylated STAT5 integrated for 60 min post Epo stimulation directly correlates with the fraction of surviving cells 24 h later.CIS and SOCS3 were identified as the most relevant transcriptional feedback regulators of JAK2/STAT5 signaling in primary erythroid progenitor cells. Applying the model, we revealed that CIS-mediated inhibitory effects are most important at low ligand concentrations, whereas SOCS3 inhibition is more effective at high ligand doses.The distinct modes of inhibition of CIS and SOCS3 at various Epo concentrations provide a strategy for achieving control of JAK2/STAT5 signaling over the entire range of physiological Epo concentrations.
Cells interpret information encoded by extracellular stimuli through the activation of intracellular signaling networks and translate this information into cellular decisions. A prime example for a system that is exposed to extremely variable ligand concentrations is the erythroid lineage. The key regulator Erythropoietin (Epo) facilitates continuous renewal of erythrocytes at low basal levels but also secures compensation in case of, e.g., blood loss through an up to 1000-fold increase in hormone concentration. The Epo receptor (EpoR) is expressed on erythroid progenitor cells at the colony forming unit erythroid (CFU-E) stage. Stimulation of these cells with Epo leads to rapid but transient activation of receptor and JAK2 phosphorylation followed by phosphorylation of the latent transcription factor STAT5. Although STAT5 is known to be an essential regulator of survival and differentiation of erythroid progenitor cells, a quantitative link between the dynamic properties of STAT5 signaling and survival decisions remained unknown. STAT5-mediated responses in CFU-E cells are modulated by multiple attenuation mechanisms that operate on different time scales. Fast-acting mechanisms such as depletion of Epo by rapid receptor turnover and recruitment of the phosphatase SHP-1 control the initial signal amplitude at the receptor level. Transcriptional feedback regulators such as suppressor of cytokine signaling (SOCS) family members CIS and SOCS3 operate at a slower time scale. Despite the ample knowledge of the individual components involved, only little is known about the specific contributions of these regulators in controlling dynamic properties of STAT5 in response to a broad range of input signals. Therefore, dynamic pathway modeling is required to understand the complex regulatory network of feedback regulators.
To address these questions, we established a dual negative feedback model of JAK2/STAT5 signaling in primary erythroid progenitor cells isolated from mouse fetal livers. We provide a large data set of JAK2/STAT5 signaling dynamics employing quantitative immunoblotting, mass spectrometry and quantitative RT–PCR measured under different perturbation conditions to calibrate our model (Figure 3). The structure of our model was constructed to comprise the minimal number of parameters necessary to explain the data. Thereby, we aimed at a model with fully identifiable parameters that are essential to obtain high predictive power. Parameter identifiability was analyzed by the profile likelihood approach. Applying this method, we could establish a dual negative feedback model of JAK2-STAT5 signaling with structurally and in most cases practically identifiable parameters.
A major bottle-neck in combining signal transduction events with cellular phenotypes is the discrepancy in the time scale and stimuli concentrations that are applied in the different experiments. The sensitivity of biochemical assays to determine phosphorylation events within minutes or hours after stimulation is usually lower than the threshold of sensitivity in assays to determine the physiological response after one or more days. Facilitated by the model, we were able to compute the integrated response of JAK2/STAT5 signaling components for experimentally unaddressable Epo concentrations. Our results demonstrate that the integrated response of pSTAT5 in the nucleus accurately correlates with the experimentally determined survival of CFU-E cells. This provides a quantitative link of the dependency of primary CFU-E cells on pSTAT5 activation dynamics. By correlation analysis, we could identify the early signaling phase (⩽1 h) of STAT5 to be the most predictive for the fraction of surviving cells, which was determined ∼24 h later. Thus, we hypothesize that as a general principle in apoptotic decisions, ligand concentrations translated into kinetic-encoded information of early signaling events downstream of receptors can be predictive for survival decisions 24 h later.
After the first hour of stimulation, it is important to constrain signaling to a residual steady-state level. Constitutive phosphorylation of the JAK2/STAT5 pathway has a crucial role in the onset of polycythemia vera (PV), a disease associated with Epo-independent erythroid differentiation. The two identified transcriptional feedback proteins, CIS and SOCS3, are responsible for adjusting the phosphorylation level of STAT5 after 1 h of stimulation. Since the Epo input signal can vary over a broad range of ligand concentrations, we asked how CIS and SOCS3 can facilitate control of STAT5 long-term phosphorylation levels over the entire physiological relevant hormone concentrations. By using model simulations, we revealed that the two feedbacks are most effective at different Epo concentration ranges. Predicted by our mathematical model, the major role of CIS in modulating STAT5 phosphorylation levels is at low, basal Epo concentrations, whereas SOCS3 is essential to control the STAT5 phosphorylation levels at high Epo doses (Figure 6). As a potential molecular mechanism of this dose-dependent inhibitory effect, we could identify the quantity of pJAK2 relative to pEpoR that increases with higher Epo concentrations. Since SOCS3 can inhibit JAK2 directly via its KIR domain to attenuate downstream STAT5 activation, SOCS3 becomes more effective with the relative increase of JAK2 activation. Hence, CIS and SOCS3 act in a concerted manner to ensure tight regulation of STAT5 responses over the broad physiological range of Epo concentrations.
In summary, our mathematical approach provided new insights into the specific function of feedback regulation in STAT5-mediated life or death decisions of primary erythroid cells. We dissected the roles of the transcriptionally induced proteins CIS and SOCS3 that operate as dual feedback with divided function thereby facilitating the control of STAT5 activation levels over the entire range of physiological Epo concentrations. The detailed understanding of the molecular processes and control distribution of Epo-induced JAK/STAT signaling can be further applied to gain insights into alterations promoting malignant hematopoietic diseases.
Cellular signal transduction is governed by multiple feedback mechanisms to elicit robust cellular decisions. The specific contributions of individual feedback regulators, however, remain unclear. Based on extensive time-resolved data sets in primary erythroid progenitor cells, we established a dynamic pathway model to dissect the roles of the two transcriptional negative feedback regulators of the suppressor of cytokine signaling (SOCS) family, CIS and SOCS3, in JAK2/STAT5 signaling. Facilitated by the model, we calculated the STAT5 response for experimentally unobservable Epo concentrations and provide a quantitative link between cell survival and the integrated response of STAT5 in the nucleus. Model predictions show that the two feedbacks CIS and SOCS3 are most effective at different ligand concentration ranges due to their distinct inhibitory mechanisms. This divided function of dual feedback regulation enables control of STAT5 responses for Epo concentrations that can vary 1000-fold in vivo. Our modeling approach reveals dose-dependent feedback control as key property to regulate STAT5-mediated survival decisions over a broad range of ligand concentrations.
doi:10.1038/msb.2011.50
PMCID: PMC3159971  PMID: 21772264
apoptosis; erythropoietin; mathematical modeling; negative feedback; SOCS
19.  Clustering of time-course gene expression profiles using normal mixture models with autoregressive random effects 
BMC Bioinformatics  2012;13:300.
Background
Time-course gene expression data such as yeast cell cycle data may be periodically expressed. To cluster such data, currently used Fourier series approximations of periodic gene expressions have been found not to be sufficiently adequate to model the complexity of the time-course data, partly due to their ignoring the dependence between the expression measurements over time and the correlation among gene expression profiles. We further investigate the advantages and limitations of available models in the literature and propose a new mixture model with autoregressive random effects of the first order for the clustering of time-course gene-expression profiles. Some simulations and real examples are given to demonstrate the usefulness of the proposed models.
Results
We illustrate the applicability of our new model using synthetic and real time-course datasets. We show that our model outperforms existing models to provide more reliable and robust clustering of time-course data. Our model provides superior results when genetic profiles are correlated. It also gives comparable results when the correlation between the gene profiles is weak. In the applications to real time-course data, relevant clusters of coregulated genes are obtained, which are supported by gene-function annotation databases.
Conclusions
Our new model under our extension of the EMMIX-WIRE procedure is more reliable and robust for clustering time-course data because it adopts a random effects model that allows for the correlation among observations at different time points. It postulates gene-specific random effects with an autocorrelation variance structure that models coregulation within the clusters. The developed R package is flexible in its specification of the random effects through user-input parameters that enables improved modelling and consequent clustering of time-course data.
doi:10.1186/1471-2105-13-300
PMCID: PMC3574839  PMID: 23151154
Time-course data; Mixtures of linear mixed models; Autoregressive random effects; EMMIX-WIRE procedure
20.  Sparse multitask regression for identifying common mechanism of response to therapeutic targets 
Bioinformatics  2010;26(12):i97-i105.
Motivation: Molecular association of phenotypic responses is an important step in hypothesis generation and for initiating design of new experiments. Current practices for associating gene expression data with multidimensional phenotypic data are typically (i) performed one-to-one, i.e. each gene is examined independently with a phenotypic index and (ii) tested with one stress condition at a time, i.e. different perturbations are analyzed separately. As a result, the complex coordination among the genes responsible for a phenotypic profile is potentially lost. More importantly, univariate analysis can potentially hide new insights into common mechanism of response.
Results: In this article, we propose a sparse, multitask regression model together with co-clustering analysis to explore the intrinsic grouping in associating the gene expression with phenotypic signatures. The global structure of association is captured by learning an intrinsic template that is shared among experimental conditions, with local perturbations introduced to integrate effects of therapeutic agents. We demonstrate the performance of our approach on both synthetic and experimental data. Synthetic data reveal that the multi-task regression has a superior reduction in the regression error when compared with traditional L1-and L2-regularized regression. On the other hand, experiments with cell cycle inhibitors over a panel of 14 breast cancer cell lines demonstrate the relevance of the computed molecular predictors with the cell cycle machinery, as well as the identification of hidden variables that are not captured by the baseline regression analysis. Accordingly, the system has identified CLCA2 as a hidden transcript and as a common mechanism of response for two therapeutic agents of CI-1040 and Iressa, which are currently in clinical use.
Contact: b_parvin@lbl.gov
doi:10.1093/bioinformatics/btq181
PMCID: PMC2881366  PMID: 20529943
21.  Microarray data can predict diurnal changes of starch content in the picoalga Ostreococcus 
BMC Systems Biology  2011;5:36.
Background
The storage of photosynthetic carbohydrate products such as starch is subject to complex regulation, effected at both transcriptional and post-translational levels. The relevant genes in plants show pronounced daily regulation. Their temporal RNA expression profiles, however, do not predict the dynamics of metabolite levels, due to the divergence of enzyme activity from the RNA profiles.
Unicellular phytoplankton retains the complexity of plant carbohydrate metabolism, and recent transcriptomic profiling suggests a major input of transcriptional regulation.
Results
We used a quasi-steady-state, constraint-based modelling approach to infer the dynamics of starch content during the 12 h light/12 h dark cycle in the model alga Ostreococcus tauri. Measured RNA expression datasets from microarray analysis were integrated with a detailed stoichiometric reconstruction of starch metabolism in O. tauri in order to predict the optimal flux distribution and the dynamics of the starch content in the light/dark cycle. The predicted starch profile was validated by experimental data over the 24 h cycle. The main genetic regulatory targets within the pathway were predicted by in silico analysis.
Conclusions
A single-reaction description of starch production is not able to account for the observed variability of diurnal activity profiles of starch-related enzymes. We developed a detailed reaction model of starch metabolism, which, to our knowledge, is the first attempt to describe this polysaccharide polymerization while preserving the mass balance relationships. Our model and method demonstrate the utility of a quasi-steady-state approach for inferring dynamic metabolic information in O. tauri directly from time-series gene expression data.
doi:10.1186/1752-0509-5-36
PMCID: PMC3056741  PMID: 21352558
22.  Multi-profile Bayesian alignment model for LC-MS data analysis with integration of internal standards 
Bioinformatics  2013;29(21):2774-2780.
Motivation: Liquid chromatography-mass spectrometry (LC-MS) has been widely used for profiling expression levels of biomolecules in various ‘-omic’ studies including proteomics, metabolomics and glycomics. Appropriate LC-MS data preprocessing steps are needed to detect true differences between biological groups. Retention time (RT) alignment, which is required to ensure that ion intensity measurements among multiple LC-MS runs are comparable, is one of the most important yet challenging preprocessing steps. Current alignment approaches estimate RT variability using either single chromatograms or detected peaks, but do not simultaneously take into account the complementary information embedded in the entire LC-MS data.
Results: We propose a Bayesian alignment model for LC-MS data analysis. The alignment model provides estimates of the RT variability along with uncertainty measures. The model enables integration of multiple sources of information including internal standards and clustered chromatograms in a mathematically rigorous framework. We apply the model to LC-MS metabolomic, proteomic and glycomic data. The performance of the model is evaluated based on ground-truth data, by measuring correlation of variation, RT difference across runs and peak-matching performance. We demonstrate that Bayesian alignment model improves significantly the RT alignment performance through appropriate integration of relevant information.
Availability and implementation: MATLAB code, raw and preprocessed LC-MS data are available at http://omics.georgetown.edu/alignLCMS.html
Contact: hwr@georgetown.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt461
PMCID: PMC3799465  PMID: 24013927
23.  An integrated machine learning approach for predicting DosR-regulated genes in Mycobacterium tuberculosis 
BMC Systems Biology  2010;4:37.
Background
DosR is an important regulator of the response to stress such as limited oxygen availability in Mycobacterium tuberculosis. Time course gene expression data enable us to dissect this response on the gene regulatory level. The mRNA expression profile of a regulator, however, is not necessarily a direct reflection of its activity. Knowing the transcription factor activity (TFA) can be exploited to predict novel target genes regulated by the same transcription factor. Various approaches have been proposed to reconstruct TFAs from gene expression data. Most of them capture only a first-order approximation to the complex transcriptional processes by assuming linear gene responses and linear dynamics in TFA, or ignore the temporal information in data from such systems.
Results
In this paper, we approach the problem of inferring dynamic hidden TFAs using Gaussian processes (GP). We are able to model dynamic TFAs and to account for both linear and nonlinear gene responses. To test the validity of the proposed approach, we reconstruct the hidden TFA of p53, a tumour suppressor activated by DNA damage, using published time course gene expression data. Our reconstructed TFA is closer to the experimentally determined profile of p53 concentration than that from the original study. We then apply the model to time course gene expression data obtained from chemostat cultures of M. tuberculosis under reduced oxygen availability. After estimation of the TFA of DosR based on a number of known target genes using the GP model, we predict novel DosR-regulated genes: the parameters of the model are interpreted as relevance parameters indicating an existing functional relationship between TFA and gene expression. We further improve the prediction by integrating promoter sequence information in a logistic regression model. Apart from the documented DosR-regulated genes, our prediction yields ten novel genes under direct control of DosR.
Conclusions
Chemostat cultures are an ideal experimental system for controlling noise and variability when monitoring the response of bacterial organisms such as M. tuberculosis to finely controlled changes in culture conditions and available metabolites. Nonlinear hidden TFA dynamics of regulators can be reconstructed remarkably well with Gaussian processes from such data. Moreover, estimated parameters of the GP can be used to assess whether a gene is controlled by the reconstructed TFA or not. It is straightforward to combine these parameters with further information, such as the presence of binding motifs, to increase prediction accuracy.
doi:10.1186/1752-0509-4-37
PMCID: PMC2867773  PMID: 20356371
24.  svdPPCS: an effective singular value decomposition-based method for conserved and divergent co-expression gene module identification 
BMC Bioinformatics  2010;11:338.
Background
Comparative analysis of gene expression profiling of multiple biological categories, such as different species of organisms or different kinds of tissue, promises to enhance the fundamental understanding of the universality as well as the specialization of mechanisms and related biological themes. Grouping genes with a similar expression pattern or exhibiting co-expression together is a starting point in understanding and analyzing gene expression data. In recent literature, gene module level analysis is advocated in order to understand biological network design and system behaviors in disease and life processes; however, practical difficulties often lie in the implementation of existing methods.
Results
Using the singular value decomposition (SVD) technique, we developed a new computational tool, named svdPPCS (SVD-based Pattern Pairing and Chart Splitting), to identify conserved and divergent co-expression modules of two sets of microarray experiments. In the proposed methods, gene modules are identified by splitting the two-way chart coordinated with a pair of left singular vectors factorized from the gene expression matrices of the two biological categories. Importantly, the cutoffs are determined by a data-driven algorithm using the well-defined statistic, SVD-p. The implementation was illustrated on two time series microarray data sets generated from the samples of accessory gland (ACG) and malpighian tubule (MT) tissues of the line W118 of M. drosophila. Two conserved modules and six divergent modules, each of which has a unique characteristic profile across tissue kinds and aging processes, were identified. The number of genes contained in these models ranged from five to a few hundred. Three to over a hundred GO terms were over-represented in individual modules with FDR < 0.1. One divergent module suggested the tissue-specific relationship between the expressions of mitochondrion-related genes and the aging process. This finding, together with others, may be of biological significance. The validity of the proposed SVD-based method was further verified by a simulation study, as well as the comparisons with regression analysis and cubic spline regression analysis plus PAM based clustering.
Conclusions
svdPPCS is a novel computational tool for the comparative analysis of transcriptional profiling. It especially fits the comparison of time series data of related organisms or different tissues of the same organism under equivalent or similar experimental conditions. The general scheme can be directly extended to the comparisons of multiple data sets. It also can be applied to the integration of data sets from different platforms and of different sources.
doi:10.1186/1471-2105-11-338
PMCID: PMC2905369  PMID: 20565989
25.  Assessing and selecting gene expression signals based upon the quality of the measured dynamics 
BMC Bioinformatics  2009;10:55.
Background
One of the challenges with modeling the temporal progression of biological signals is dealing with the effect of noise and the limited number of replicates at each time point. Given the rising interest in utilizing predictive mathematical models to describe the biological response of an organism or analysis such as clustering and gene ontology enrichment, it is important to determine whether the dynamic progression of the data has been accurately captured despite the limited number of replicates, such that one can have confidence that the results of the analysis are capturing important salient dynamic features.
Results
By pre-selecting genes based upon quality before the identification of differential expression via algorithm such as EDGE, it was found that the percentage of statistically enriched ontologies (p < .05) was improved. Furthermore, it was found that a majority of the genes found via the proposed technique were also selected via an EDGE selection though the reverse was not necessarily true. It was also found that improvements offered by the proposed algorithm are anti-correlated with improvements in the various microarray platforms and the number of replicates. This is illustrated by the fact that newer arrays and experiments with more replicates show less improvement when the filtering for quality is first run before the selection of differentially expressed genes. This suggests that the increase in the number of replicates as well as improvements in array technologies are increase the confidence one has in the dynamics obtained from the experiment.
Conclusion
We have developed an algorithm that quantifies the quality of temporal biological signal rather than whether the signal illustrates a significant change over the experimental time course. Because the use of these temporal signals, whether it is in mathematical modeling or clustering, focuses upon the entire time series, it is necessary to develop a method to quantify and select for signals which conform to this ideal. By doing this, we have demonstrated a marked and consistent improvement in the results of a clustering exercise over multiple experiments, microarray platforms, and experimental designs.
doi:10.1186/1471-2105-10-55
PMCID: PMC2653486  PMID: 19208252

Results 1-25 (1441262)