Zebrafish embryos have recently been established as a xenotransplantation model of the metastatic behaviour of primary human tumours. Current tools for automated data extraction from the microscope images are restrictive concerning the developmental stage of the embryos, usually require laborious manual image preprocessing, and, in general, cannot characterize the metastasis as a function of the internal organs.
We present a tool, ZebIAT, that allows both automatic or semi-automatic registration of the outer contour and inner organs of zebrafish embryos. ZebIAT provides a registration at different stages of development and an automatic analysis of cancer metastasis per organ, thus allowing to study cancer progression. The semi-automation relies on a graphical user interface.
We quantified the performance of the registration method, and found it to be accurate, except in some of the smallest organs. Our results show that the accuracy of registering small organs can be improved by introducing few manual corrections. We also demonstrate the applicability of the tool to studies of cancer progression.
ZebIAT offers major improvement relative to previous tools by allowing for an analysis on a per-organ or region basis. It should be of use in high-throughput studies of cancer metastasis in zebrafish embryos.
Cell imaging is becoming an indispensable tool for cell and molecular biology research. However, most processes studied are stochastic in nature, and require the observation of many cells and events. Ideally, extraction of information from these images ought to rely on automatic methods. Here, we propose a novel segmentation method, MAMLE, for detecting cells within dense clusters.
MAMLE executes cell segmentation in two stages. The first relies on state of the art filtering technique, edge detection in multi-resolution with morphological operator and threshold decomposition for adaptive thresholding. From this result, a correction procedure is applied that exploits maximum likelihood estimate as an objective function. Also, it acquires morphological features from the initial segmentation for constructing the likelihood parameter, after which the final segmentation is obtained.
We performed an empirical evaluation that includes sample images from different imaging modalities and diverse cell types. The new method attained very high (above 90%) cell segmentation accuracy in all cases. Finally, its accuracy was compared to several existing methods, and in all tests, MAMLE outperformed them in segmentation accuracy.
High-throughput genome-wide screening to study gene-specific functions, e.g. for drug discovery, demands fast automated image analysis methods to assist in unraveling the full potential of such studies. Image segmentation is typically at the forefront of such analysis as the performance of the subsequent steps, for example, cell classification, cell tracking etc., often relies on the results of segmentation.
We present a cell cytoplasm segmentation framework which first separates cell cytoplasm from image background using novel approach of image enhancement and coefficient of variation of multi-scale Gaussian scale-space representation. A novel outline-learning based classification method is developed using regularized logistic regression with embedded feature selection which classifies image pixels as outline/non-outline to give cytoplasm outlines. Refinement of the detected outlines to separate cells from each other is performed in a post-processing step where the nuclei segmentation is used as contextual information.
Results and conclusions
We evaluate the proposed segmentation methodology using two challenging test cases, presenting images with completely different characteristics, with cells of varying size, shape, texture and degrees of overlap. The feature selection and classification framework for outline detection produces very simple sparse models which use only a small subset of the large, generic feature set, that is, only 7 and 5 features for the two cases. Quantitative comparison of the results for the two test cases against state-of-the-art methods show that our methodology outperforms them with an increase of 4-9% in segmentation accuracy with maximum accuracy of 93%. Finally, the results obtained for diverse datasets demonstrate that our framework not only produces accurate segmentation but also generalizes well to different segmentation tasks.
We explore whether the process of multimerization can be used as a means to regulate noise in the abundance of functional protein complexes. Additionally, we analyze how this process affects the mean level of these functional units, response time of a gene, and temporal correlation between the numbers of expressed proteins and of the functional multimers. We show that, although multimerization increases noise by reducing the mean number of functional complexes it can reduce noise in comparison with a monomer, when abundance of the functional proteins are comparable. Alternatively, reduction in noise occurs if both monomeric and multimeric forms of the protein are functional. Moreover, we find that multimerization either increases the response time to external signals or decreases the correlation between number of functional complexes and protein production kinetics. Finally, we show that the results are in agreement with recent genome-wide assessments of cell-to-cell variability in protein numbers and of multimerization in essential and non-essential genes in Escherichia coli, and that the effects of multimerization are tangible at the level of genetic circuits.
Cancer is a broad group of genetic diseases which account for millions of deaths worldwide each year. Cancers are classified by various clinical, pathological and molecular methods, but even within a well-characterized disease, there is a significant inter-patient variability in survival, response to treatment, and other parameters. Especially in molecular level, tumours of the same category can appear significantly dissimilar due to complex combinations of genetic aberrations leading to a similar malignancy. We extended the current classification methods by studying tumour heterogeneity at pathway level.
We computed the rate of alterations in 1994 pathways and 2210 tumours consisting of eight different cancers. Using gene set enrichment analysis, each sample was computed a pathway aberration profile that reflected its molecular state. The profiles were analysed together to infer the characteristic aberration rates for each pathway within each cancer. Subgroups of tumours defined by similar pathway aberrations were identified using clustering analyses. The pathway aberration and gene expression profiles of the subgroups were consecutively compared across all eight cancer types to search for similar tumours crossing the standard classification.
We identified pathways and processes that were common to all cancers as well as traits that are unique to a cancer type or closely related cancers. Studying the gene expression patterns within the pathway context suggested potential alteration mechanisms. Clustering analysis revealed five clinically relevant subgroups of tumours in four cancers that exhibited significant differences in survival compared to others. The cross-cancer analysis of the subgroups resulted in the identification of tumours that shared potentially significant alterations.
This study represents the first effort to extend the molecular characterizations towards pathway level descriptions across the family of cancers. In addition to providing a proof-of-concept for single sample pathway aberration analysis in this context, we present a comprehensive pathway aberration dataset that can be used to study pathway aberration patterns within or across cancers. Significant similarities between subgroups of different cancers on pathway and gene expression levels provide interesting hypotheses for understanding variable drug response, or transferring treatments across diseases by identifying common druggable pathways or genes, for example.
The behavior of genetic motifs is determined not only by the gene-gene interactions, but also by the expression patterns of the constituent genes. Live single-molecule measurements have provided evidence that transcription initiation is a sequential process, whose kinetics plays a key role in the dynamics of mRNA and protein numbers. The extent to which it affects the behavior of cellular motifs is unknown. Here, we examine how the kinetics of transcription initiation affects the behavior of motifs performing filtering in amplitude and frequency domain. We find that the performance of each filter is degraded as transcript levels are lowered. This effect can be reduced by having a transcription process with more steps. In addition, we show that the kinetics of the stepwise transcription initiation process affects features such as filter cutoffs. These results constitute an assessment of the range of behaviors of genetic motifs as a function of the kinetics of transcription initiation, and thus will aid in tuning of synthetic motifs to attain specific characteristics without affecting their protein products.
The potential impact of nanoparticles on the environment and on human health has attracted considerable interest worldwide. The amount of transcriptomics data, in which tissues and cell lines are exposed to nanoparticles, increases year by year. In addition to the importance of the original findings, this data can have value in broader context when combined with other previously acquired and published results. In order to facilitate the efficient usage of the data, we have developed the NanoMiner web resource (http://nanominer.cs.tut.fi/), which contains 404 human transcriptome samples exposed to various types of nanoparticles. All the samples in NanoMiner have been annotated, preprocessed and normalized using standard methods that ensure the quality of the data analyses and enable the users to utilize the database systematically across the different experimental setups and platforms. With NanoMiner it is possible to 1) search and plot the expression profiles of one or several genes of interest, 2) cluster the samples within the datasets, 3) find differentially expressed genes in various nanoparticle studies, 4) detect the nanoparticles causing differential expression of selected genes, 5) analyze enriched Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) terms for the detected genes and 6) search the expression values and differential expressions of the genes belonging to a specific KEGG pathway or Gene Ontology. In sum, NanoMiner database is a valuable collection of microarray data which can be also used as a data repository for future analyses.
Using a single-RNA detection technique in live Escherichia coli cells, we measure, for each cell, the waiting time for the production of the first RNA under the control of PBAD promoter after induction by arabinose, and subsequent intervals between transcription events. We find that the kinetics of the arabinose intake system affect mean and diversity in RNA numbers, long after induction. We observed the same effect on Plac/ara-1 promoter, which is inducible by arabinose or by IPTG. Importantly, the distribution of waiting times of Plac/ara-1 is indistinguishable from that of PBAD, if and only if induced by arabinose alone. Finally, RNA production under the control of PBAD is found to be a sub-Poissonian process. We conclude that inducer-dependent waiting times affect mean and cell-to-cell diversity in RNA numbers long after induction, suggesting that intake mechanisms have non-negligible effects on the phenotypic diversity of cell populations in natural, fluctuating environments.
Fusion genes are chromosomal aberrations that are found in many cancers and can be used as prognostic markers and drug targets in clinical practice. Fusions can lead to production of oncogenic fusion proteins or to enhanced expression of oncogenes. Several recent studies have reported that some fusion genes can escape microRNA regulation via 3′–untranslated region (3′-UTR) deletion. We performed whole transcriptome sequencing to identify fusion genes in glioma and discovered FGFR3-TACC3 fusions in 4 of 48 glioblastoma samples from patients both of mixed European and of Asian descent, but not in any of 43 low-grade glioma samples tested. The fusion, caused by tandem duplication on 4p16.3, led to the loss of the 3′-UTR of FGFR3, blocking gene regulation of miR-99a and enhancing expression of the fusion gene. The fusion gene was mutually exclusive with EGFR, PDGFR, or MET amplification. Using cultured glioblastoma cells and a mouse xenograft model, we found that fusion protein expression promoted cell proliferation and tumor progression, while WT FGFR3 protein was not tumorigenic, even under forced overexpression. These results demonstrated that the FGFR3-TACC3 gene fusion is expressed in human cancer and generates an oncogenic protein that promotes tumorigenesis in glioblastoma.
Escherichia coli cells employ an asymmetric strategy at division, segregating unwanted substances to older poles, which has been associated with aging in these organisms. The kinetics of this process is still poorly understood. Using the MS2 coat protein fused to green fluorescent protein (GFP) and a reporter construct with multiple MS2 binding sites, we tracked individual RNA-MS2-GFP complexes in E. coli cells from the time when they were produced. Analyses of the kinetics and brightness of the spots showed that these spots appear in the midcell region, are composed of a single RNA-MS2-GFP complex, and reach a pole before another target RNA is formed, typically remaining there thereafter. The choice of pole is probabilistic and heavily biased toward one pole, similar to what was observed by previous studies regarding protein aggregates. Additionally, this mechanism was found to act independently on each disposed molecule. Finally, while the RNA-MS2-GFP complexes were disposed of, the MS2-GFP tagging molecules alone were not. We conclude that this asymmetric mechanism to segregate damage at the expense of aging individuals acts probabilistically on individual molecules and is capable of the accurate classification of molecules for disposal.
In Escherichia coli, tetracycline prevents translation. When subject to tetracycline, E. coli express TetA to pump it out by a mechanism that is sensitive, while fairly independent of cellular metabolism. We constructed a target gene, PtetA-mRFP1-96BS, with a 96 MS2-GFP binding site array in a single-copy BAC vector, whose expression is controlled by the tetA promoter. We measured the in vivo kinetics of production of individual RNA molecules of the target gene as a function of inducer concentration and temperature. From the distributions of intervals between transcription events, we find that RNA production by PtetA is a sub-Poissonian process. Next, we infer the number and duration of the prominent sequential steps in transcription initiation by maximum likelihood estimation. Under full induction and at optimal temperature, we observe three major steps. We find that the kinetics of RNA production under the control of PtetA, including number and duration of the steps, varies with induction strength and temperature. The results are supported by a set of logical pairwise Kolmogorov-Smirnov tests. We conclude that the expression of TetA is controlled by a sequential mechanism that is robust, whereas sensitive to external signals.
In Escherichia coli the mean and cell-to-cell diversity in RNA numbers of different genes vary widely. This is likely due to different kinetics of transcription initiation, a complex process with multiple rate-limiting steps that affect RNA production.
We measured the in vivo kinetics of production of individual RNA molecules under the control of the lar promoter in E. coli. From the analysis of the distributions of intervals between transcription events in the regimes of weak and medium induction, we find that the process of transcription initiation of this promoter involves a sequential mechanism with two main rate-limiting steps, each lasting hundreds of seconds. Both steps become faster with increasing induction by IPTG and Arabinose.
The two rate-limiting steps in initiation are found to be important regulators of the dynamics of RNA production under the control of the lar promoter in the regimes of weak and medium induction. Variability in the intervals between consecutive RNA productions is much lower than if there was only one rate-limiting step with a duration following an exponential distribution. The methodology proposed here to analyze the in vivo dynamics of transcription may be applicable at a genome-wide scale and provide valuable insight into the dynamics of prokaryotic genetic networks.
In preclinical studies, human adipose stem cells (ASCs) have been shown to have therapeutic applicability, but standard expansion methods for clinical applications remain yet to be established. ASCs are typically expanded in the medium containing fetal bovine serum (FBS). However, sera and other animal-derived culture reagents stage safety issues in clinical therapy, including possible infections and severe immune reactions. By expanding ASCs in the medium containing human serum (HS), the problem can be eliminated. To define how allogeneic HS (alloHS) performs in ASC expansion compared to FBS, a comparative in vitro study in both serum supplements was performed. The choice of serum had a significant effect on ASCs. First, to reach cell proliferation levels comparable with 10% FBS, at least 15% alloHS was required. Second, while genes of the cell cycle pathway were overexpressed in alloHS, genes of the bone morphogenetic protein receptor–mediated signaling on the transforming growth factor beta signaling pathway regulating, for example, osteoblast differentiation, were overexpressed in FBS. The result was further supported by differentiation analysis, where early osteogenic differentiation was significantly enhanced in FBS. The data presented here underscore the importance of thorough investigation of ASCs for utilization in cell therapies. This study is a step forward in the understanding of these potential cells.
Patterns of genome-wide methylation vary between tissue types. For example, cancer tissue shows markedly different patterns from those of normal tissue. In this paper we propose a beta-mixture model to describe genome-wide methylation patterns based on probe data from methylation microarrays. The model takes dependencies between neighbour probe pairs into account and assumes three broad categories of methylation, low, medium and high. The model is described by 37 parameters, which reduces the dimensionality of a typical methylation microarray significantly. We used methylation microarray data from 42 colon cancer samples to assess the model.
Based on data from colon cancer samples we show that our model captures genome-wide characteristics of methylation patterns. We estimate the parameters of the model and show that they vary between different tissue types. Further, for each methylation probe the posterior probability of a methylation state (low, medium or high) is calculated and the probability that the state is correctly predicted is assessed. We demonstrate that the model can be applied to classify cancer tissue types accurately and that the model provides accessible and easily interpretable data summaries.
We have developed a beta-mixture model for methylation microarray data. The model substantially reduces the dimensionality of the data. It can be used for further analysis, such as sample classification or to detect changes in methylation status between different samples and tissues.
Gene expression in Escherichia coli is regulated by several mechanisms. We measured in single cells the expression level of a single copy gene coding for green fluorescent protein (GFP), integrated into the genome and driven by a tetracycline inducible promoter, for varying induction strengths. Also, we measured the transcriptional activity of a tetracycline inducible promoter controlling the transcription of a RNA with 96 binding sites for MS2-GFP.
The distribution of GFP levels in single cells is found to change significantly as induction reaches high levels, causing the Fano factor of the cells' protein levels to increase with mean level, beyond what would be expected from a Poisson-like process of RNA transcription. In agreement, the Fano factor of the cells' number of RNA molecules target for MS2-GFP follows a similar trend. The results provide evidence that the dynamics of the promoter complex formation, namely, the variability in its duration from one transcription event to the next, explains the change in the distribution of expression levels in the cell population with induction strength.
The results suggest that the open complex formation of the tetracycline inducible promoter, in the regime of strong induction, affects significantly the dynamics of RNA production due to the variability of its duration from one event to the next.
In prokaryotes, transcription and translation are dynamically coupled, as the latter starts before the former is complete. Also, from one transcript, several translation events occur in parallel. To study how events in transcription elongation affect translation elongation and fluctuations in protein levels, we propose a delayed stochastic model of prokaryotic transcription and translation at the nucleotide and codon level that includes the promoter open complex formation and alternative pathways to elongation, namely pausing, arrests, editing, pyrophosphorolysis, RNA polymerase traffic, and premature termination. Stepwise translation can start after the ribosome binding site is formed and accounts for variable codon translation rates, ribosome traffic, back-translocation, drop-off, and trans-translation.
First, we show that the model accurately matches measurements of sequence-dependent translation elongation dynamics. Next, we characterize the degree of coupling between fluctuations in RNA and protein levels, and its dependence on the rates of transcription and translation initiation. Finally, modeling sequence-specific transcriptional pauses, we find that these affect protein noise levels.
For parameter values within realistic intervals, transcription and translation are found to be tightly coupled in Escherichia coli, as the noise in protein levels is mostly determined by the underlying noise in RNA levels. Sequence-dependent events in transcription elongation, e.g. pauses, are found to cause tangible effects in the degree of fluctuations in protein levels.
Neuronal networks exhibit a wide diversity of structures, which contributes to the diversity of the dynamics therein. The presented work applies an information theoretic framework to simultaneously analyze structure and dynamics in neuronal networks. Information diversity within the structure and dynamics of a neuronal network is studied using the normalized compression distance. To describe the structure, a scheme for generating distance-dependent networks with identical in-degree distribution but variable strength of dependence on distance is presented. The resulting network structure classes possess differing path length and clustering coefficient distributions. In parallel, comparable realistic neuronal networks are generated with NETMORPH simulator and similar analysis is done on them. To describe the dynamics, network spike trains are simulated using different network structures and their bursting behaviors are analyzed. For the simulation of the network activity the Izhikevich model of spiking neurons is used together with the Tsodyks model of dynamical synapses. We show that the structure of the simulated neuronal networks affects the spontaneous bursting activity when measured with bursting frequency and a set of intraburst measures: the more locally connected networks produce more and longer bursts than the more random networks. The information diversity of the structure of a network is greatest in the most locally connected networks, smallest in random networks, and somewhere in between in the networks between order and disorder. As for the dynamics, the most locally connected networks and some of the in-between networks produce the most complex intraburst spike trains. The same result also holds for sparser of the two considered network densities in the case of full spike trains.
information diversity; neuronal network; structure-dynamics relationship; complexity
We propose a Markov chain approximation of the delayed stochastic simulation algorithm to infer properties of the mechanisms in prokaryote transcription from the dynamics of RNA levels. We model transcription using the delayed stochastic modelling strategy and realistic parameter values for rate of transcription initiation and RNA degradation. From the model, we generate time series of RNA levels at the single molecule level, from which we use the method to infer the duration of the promoter open complex formation. This is found to be possible even when adding external Gaussian noise to the RNA levels.
A gene network's capacity to process information, so as to bind past events to future actions, depends on its structure and logic. From previous and new microarray measurements in Saccharomyces cerevisiae following gene deletions and overexpressions, we identify a core gene regulatory network (GRN) of functional interactions between 328 genes and the transfer functions of each gene. Inferred connections are verified by gene enrichment.
We find that this core network has a generalized clustering coefficient that is much higher than chance. The inferred Boolean transfer functions have a mean p-bias of 0.41, and thus similar amounts of activation and repression interactions. However, the distribution of p-biases differs significantly from what is expected by chance that, along with the high mean connectivity, is found to cause the core GRN of S. cerevisiae's to have an overall sensitivity similar to critical Boolean networks. In agreement, we find that the amount of information propagated between nodes in finite time series is much higher in the inferred core GRN of S. cerevisiae than what is expected by chance.
We suggest that S. cerevisiae is likely to have evolved a core GRN with enhanced information propagation among its genes.
Molecular interaction networks establish all cell biological processes. The networks are under intensive research that is facilitated by new high-throughput measurement techniques for the detection, quantification, and characterization of molecules and their physical interactions. For the common model organism yeast Saccharomyces cerevisiae, public databases store a significant part of the accumulated information and, on the way to better understanding of the cellular processes, there is a need to integrate this information into a consistent reconstruction of the molecular interaction network. This work presents and validates RefRec, the most comprehensive molecular interaction network reconstruction currently available for yeast. The reconstruction integrates protein synthesis pathways, a metabolic network, and a protein-protein interaction network from major biological databases. The core of the reconstruction is based on a reference object approach in which genes, transcripts, and proteins are identified using their primary sequences. This enables their unambiguous identification and non-redundant integration. The obtained total number of different molecular species and their connecting interactions is ∼67,000. In order to demonstrate the capacity of RefRec for functional predictions, it was used for simulating the gene knockout damage propagation in the molecular interaction network in ∼590,000 experimentally validated mutant strains. Based on the simulation results, a statistical classifier was subsequently able to correctly predict the viability of most of the strains. The results also showed that the usage of different types of molecular species in the reconstruction is important for accurate phenotype prediction. In general, the findings demonstrate the benefits of global reconstructions of molecular interaction networks. With all the molecular species and their physical interactions explicitly modeled, our reconstruction is able to serve as a valuable resource in additional analyses involving objects from multiple molecular -omes. For that purpose, RefRec is freely available in the Systems Biology Markup Language format.
Several algorithms have been proposed for detecting fluorescently labeled subcellular objects in microscope images. Many of these algorithms have been designed for specific tasks and validated with limited image data. But despite the potential of using extensive comparisons between algorithms to provide useful information to guide method selection and thus more accurate results, relatively few studies have been performed.
To better understand algorithm performance under different conditions, we have carried out a comparative study including eleven spot detection or segmentation algorithms from various application fields. We used microscope images from well plate experiments with a human osteosarcoma cell line and frames from image stacks of yeast cells in different focal planes. These experimentally derived images permit a comparison of method performance in realistic situations where the number of objects varies within image set. We also used simulated microscope images in order to compare the methods and validate them against a ground truth reference result. Our study finds major differences in the performance of different algorithms, in terms of both object counts and segmentation accuracies.
These results suggest that the selection of detection algorithms for image based screens should be done carefully and take into account different conditions, such as the possibility of acquiring empty images or images with very few spots. Our inclusion of methods that have not been used before in this context broadens the set of available detection methods and compares them against the current state-of-the-art methods for subcellular particle detection.
Stochasticity in gene expression affects many cellular processes and is a source of phenotypic diversity between genetically identical individuals. Events in elongation, particularly RNA polymerase pausing, are a source of this noise. Since the rate and duration of pausing are sequence-dependent, this regulatory mechanism of transcriptional dynamics is evolvable. The dependency of pause propensity on regulatory molecules makes pausing a response mechanism to external stress. Using a delayed stochastic model of bacterial transcription at the single nucleotide level that includes the promoter open complex formation, pausing, arrest, misincorporation and editing, pyrophosphorolysis, and premature termination, we investigate how RNA polymerase pausing affects a gene's transcriptional dynamics and gene networks. We show that pauses' duration and rate of occurrence affect the bursting in RNA production, transcriptional and translational noise, and the transient to reach mean RNA and protein levels. In a genetic repressilator, increasing the pausing rate and the duration of pausing events increases the period length but does not affect the robustness of the periodicity. We conclude that RNA polymerase pausing might be an important evolvable feature of genetic networks.
Investigation on how phenotypic diversity of genetically identical organisms is generated and regulated has focused on noise in gene expression. It is unknown to what extent noise in gene expression and genetic networks is evolvable, and by which mechanisms it evolves. The noise has several sources, e.g., noise in transcription initiation and during elongation. We focus on RNA polymerase (RNAP) pausing and show that it can regulate, to some extent, noise in gene expression. RNAP frequently pauses during elongation. The pausing frequency and average duration are sequence-specific, thus evolvable. The dependency of pause propensity on regulatory molecules makes pausing a mechanism adaptable to rapidly changing environments. We study, in a stochastic model of bacterial transcription at the single nucleotide level that includes the promoter open complex formation, pausing, arrest, misincorporation and editing, pyrophosphorolysis, and premature termination, how pausing affects the dynamics of gene expression and gene networks. In a model of a genetic clock, with periodic dynamics, pauses affect the period length but do not disrupt the periodicity. We conclude that RNAP pausing is an important evolvable feature of gene regulatory networks, that can be used by organisms to adapt to changing environments and regulate phenotypic diversity.
Fluorescence microscopy is the standard tool for detection and analysis of cellular phenomena. This technique, however, has a number of drawbacks such as the limited number of available fluorescent channels in microscopes, overlapping excitation and emission spectra of the stains, and phototoxicity.
We here present and validate a method to automatically detect cell population outlines directly from bright field images. By imaging samples with several focus levels forming a bright field -stack, and by measuring the intensity variations of this stack over the -dimension, we construct a new two dimensional projection image of increased contrast. With additional information for locations of each cell, such as stained nuclei, this bright field projection image can be used instead of whole cell fluorescence to locate borders of individual cells, separating touching cells, and enabling single cell analysis. Using the popular CellProfiler freeware cell image analysis software mainly targeted for fluorescence microscopy, we validate our method by automatically segmenting low contrast and rather complex shaped murine macrophage cells.
The proposed approach frees up a fluorescence channel, which can be used for subcellular studies. It also facilitates cell shape measurement in experiments where whole cell fluorescent staining is either not available, or is dependent on a particular experimental condition. We show that whole cell area detection results using our projected bright field images match closely to the standard approach where cell areas are localized using fluorescence, and conclude that the high contrast bright field projection image can directly replace one fluorescent channel in whole cell quantification. Matlab code for calculating the projections can be downloaded from the supplementary site: http://sites.google.com/site/brightfieldorstaining
An important milestone in revealing cells' functions is to build a comprehensive understanding of transcriptional regulation processes. These processes are largely regulated by transcription factors (TFs) binding to DNA sites. Several TF binding site (TFBS) prediction methods have been developed, but they usually model binding of a single TF at a time albeit few methods for predicting binding of multiple TFs also exist. In this article, we propose a probabilistic model that predicts binding of several TFs simultaneously. Our method explicitly models the competitive binding between TFs and uses the prior knowledge of existing protein–protein interactions (PPIs), which mimics the situation in the nucleus. Modeling DNA binding for multiple TFs improves the accuracy of binding site prediction remarkably when compared with other programs and the cases where individual binding prediction results of separate TFs have been combined. The traditional TFBS prediction methods usually predict overwhelming number of false positives. This lack of specificity is overcome remarkably with our competitive binding prediction method. In addition, previously unpredictable binding sites can be detected with the help of PPIs. Source codes are available at http://www.cs.tut.fi/∼harrila/.
Cluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. A number of different approaches have been proposed for that purpose, out of which different mixture models provide a principled probabilistic framework. Cluster analysis is increasingly often supplemented with multiple data sources nowadays, and these heterogeneous information sources should be made as efficient use of as possible.
This paper presents a novel Beta-Gaussian mixture model (BGMM) for clustering genes based on Gaussian distributed and beta distributed data. The proposed BGMM can be viewed as a natural extension of the beta mixture model (BMM) and the Gaussian mixture model (GMM). The proposed BGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework, which provides a more efficient use of multiple data sources than methods that analyze different data sources separately. Moreover, BGMM provides an exceedingly flexible modeling framework since many data sources can be modeled as Gaussian or beta distributed random variables, and it can also be extended to integrate data that have other parametric distributions as well, which adds even more flexibility to this model-based clustering framework. We developed three types of estimation algorithms for BGMM, the standard expectation maximization (EM) algorithm, an approximated EM and a hybrid EM, and propose to tackle the model selection problem by well-known model selection criteria, for which we test the Akaike information criterion (AIC), a modified AIC (AIC3), the Bayesian information criterion (BIC), and the integrated classification likelihood-BIC (ICL-BIC).
Performance tests with simulated data show that combining two different data sources into a single mixture joint model greatly improves the clustering accuracy compared with either of its two extreme cases, GMM or BMM. Applications with real mouse gene expression data (modeled as Gaussian distribution) and protein-DNA binding probabilities (modeled as beta distribution) also demonstrate that BGMM can yield more biologically reasonable results compared with either of its two extreme cases. One of our applications has found three groups of genes that are likely to be involved in Myd88-dependent Toll-like receptor 3/4 (TLR-3/4) signaling cascades, which might be useful to better understand the TLR-3/4 signal transduction.