Restricted Boolean networks are simplified Boolean networks that are required for either negative or positive regulations between genes. Higa et al. (BMC Proc 5:S5, 2011) proposed a three-rule algorithm to infer a restricted Boolean network from time-series data. However, the algorithm suffers from a major drawback, namely, it is very sensitive to noise. In this paper, we systematically analyze the regulatory relationships between genes based on the state switch of the target gene and propose an algorithm with which restricted Boolean networks may be inferred from time-series data. We compare the proposed algorithm with the three-rule algorithm and the best-fit algorithm based on both synthetic networks and a well-studied budding yeast cell cycle network. The performance of the algorithms is evaluated by three distance metrics: the normalized-edge Hamming distance μhame, the normalized Hamming distance of state transition μhamst, and the steady-state distribution distance μssd. Results show that the proposed algorithm outperforms the others according to both μhame and μhamst, whereas its performance according to μssd is intermediate between best-fit and the three-rule algorithms. Thus, our new algorithm is more appropriate for inferring interactions between genes from time-series data.
doi:10.1186/s13637-014-0010-5
PMCID: PMC4107581
PMID: 25093019
Restricted Boolean network; Inference; Budding yeast cell cycle
Digital signal processing (DSP) techniques for biological sequence analysis continue to grow in popularity due to the inherent digital nature of these sequences. DSP methods have demonstrated early success for detection of coding regions in a gene. Recently, these methods are being used to establish DNA gene similarity. We present the inter-coefficient difference (ICD) transformation, a novel extension of the discrete Fourier transformation, which can be applied to any DNA sequence. The ICD method is a mathematical, alignment-free DNA comparison method that generates a genetic signature for any DNA sequence that is used to generate relative measures of similarity among DNA sequences. We demonstrate our method on a set of insulin genes obtained from an evolutionarily wide range of species, and on a set of avian influenza viral sequences, which represents a set of highly similar sequences. We compare phylogenetic trees generated using our technique against trees generated using traditional alignment techniques for similarity and demonstrate that the ICD method produces a highly accurate tree without requiring an alignment prior to establishing sequence similarity.
doi:10.1186/1687-4153-2014-8
PMCID: PMC4077688
PMID: 24991213
Discrete Fourier transform; Sequence analysis; Sequence similarity
Copy number variations (CNVs) are abundant in the human genome. They have been associated with complex traits in genome-wide association studies (GWAS) and expected to continue playing an important role in identifying the etiology of disease phenotypes. As a result of current high throughput whole-genome single-nucleotide polymorphism (SNP) arrays, we currently have datasets that simultaneously have integer copy numbers in CNV regions as well as SNP genotypes. At the same time, haplotypes that have been shown to offer advantages over genotypes in identifying disease traits even though available for SNP genotypes are largely not available for CNV/SNP data due to insufficient computational tools. We introduce a new framework for inferring haplotypes in CNV/SNP data using a sequential Monte Carlo sampling scheme ‘Tree-Based Deterministic Sampling CNV’ (TDSCNV). We compare our method with polyHap(v2.0), the only currently available software able to perform inference in CNV/SNP genotypes, on datasets of varying number of markers. We have found that both algorithms show similar accuracy but TDSCNV is an order of magnitude faster while scaling linearly with the number of markers and number of individuals and thus could be the method of choice for haplotype inference in such datasets. Our method is implemented in the TDSCNV package which is available for download at http://www.ee.columbia.edu/~anastas/tdscnv.
doi:10.1186/1687-4153-2014-7
PMCID: PMC4017783
PMID: 24868199
Perfect knowledge of the underlying state transition probabilities is necessary for designing an optimal intervention strategy for a given Markovian genetic regulatory network. However, in many practical situations, the complex nature of the network and/or identification costs limit the availability of such perfect knowledge. To address this difficulty, we propose to take a Bayesian approach and represent the system of interest as an uncertainty class of several models, each assigned some probability, which reflects our prior knowledge about the system. We define the objective function to be the expected cost relative to the probability distribution over the uncertainty class and formulate an optimal Bayesian robust intervention policy minimizing this cost function. The resulting policy may not be optimal for a fixed element within the uncertainty class, but it is optimal when averaged across the uncertainly class. Furthermore, starting from a prior probability distribution over the uncertainty class and collecting samples from the process over time, one can update the prior distribution to a posterior and find the corresponding optimal Bayesian robust policy relative to the posterior distribution. Therefore, the optimal intervention policy is essentially nonstationary and adaptive.
doi:10.1186/1687-4153-2014-6
PMCID: PMC3983901
PMID: 24708650
Optimal intervention; Markovian gene regulatory networks; Probabilistic Boolean networks; Uncertainty; Prior knowledge; Bayesian control
Parameter estimation in dynamic systems finds applications in various disciplines, including system biology. The well-known expectation-maximization (EM) algorithm is a popular method and has been widely used to solve system identification and parameter estimation problems. However, the conventional EM algorithm cannot exploit the sparsity. On the other hand, in gene regulatory network inference problems, the parameters to be estimated often exhibit sparse structure. In this paper, a regularized expectation-maximization (rEM) algorithm for sparse parameter estimation in nonlinear dynamic systems is proposed that is based on the maximum a posteriori (MAP) estimation and can incorporate the sparse prior. The expectation step involves the forward Gaussian approximation filtering and the backward Gaussian approximation smoothing. The maximization step employs a re-weighted iterative thresholding method. The proposed algorithm is then applied to gene regulatory network inference. Results based on both synthetic and real data show the effectiveness of the proposed algorithm.
doi:10.1186/1687-4153-2014-5
PMCID: PMC3998071
PMID: 24708632
Nonlinear dynamic system; Parameter estimation; Sparsity; Expectation-maximization; Forward-backward recursion; Gaussian approximation; Gene regulatory network
The analysis of gene network robustness to noise and mutation is important for fundamental and practical reasons. Robustness refers to the stability of the equilibrium expression state of a gene network to variations of the initial expression state and network topology. Numerical simulation of these variations is commonly used for the assessment of robustness. Since there exists a great number of possible gene network topologies and initial states, even millions of simulations may be still too small to give reliable results. When the initial and equilibrium expression states are restricted to being saturated (i.e., their elements can only take values 1 or −1 corresponding to maximum activation and maximum repression of genes), an analytical gene network robustness assessment is possible. We present this analytical treatment based on determination of the saturated fixed point attractors for sigmoidal function models. The analysis can determine (a) for a given network, which and how many saturated equilibrium states exist and which and how many saturated initial states converge to each of these saturated equilibrium states and (b) for a given saturated equilibrium state or a given pair of saturated equilibrium and initial states, which and how many gene networks, referred to as viable, share this saturated equilibrium state or the pair of saturated equilibrium and initial states. We also show that the viable networks sharing a given saturated equilibrium state must follow certain patterns. These capabilities of the analytical treatment make it possible to properly define and accurately determine robustness to noise and mutation for gene networks. Previous network research conclusions drawn from performing millions of simulations follow directly from the results of our analytical treatment. Furthermore, the analytical results provide criteria for the identification of model validity and suggest modified models of gene network dynamics. The yeast cell-cycle network is used as an illustration of the practical application of this analytical treatment.
doi:10.1186/1687-4153-2014-4
PMCID: PMC3998189
PMID: 24650364
Robustness of gene networks; Fixed point attractor; Sigmoidal function; Yeast cell-cycle network
It is widely accepted that cellular requirements and environmental conditions dictate the architecture of genetic regulatory networks. Nonetheless, the status quo in regulatory network modeling and analysis assumes an invariant network topology over time. In this paper, we refocus on a dynamic perspective of genetic networks, one that can uncover substantial topological changes in network structure during biological processes such as developmental growth. We propose a novel outlook on the inference of time-varying genetic networks, from a limited number of noisy observations, by formulating the network estimation as a target tracking problem. We overcome the limited number of observations (small n large p problem) by performing tracking in a compressed domain. Assuming linear dynamics, we derive the LASSO-Kalman smoother, which recursively computes the minimum mean-square sparse estimate of the network connectivity at each time point. The LASSO operator, motivated by the sparsity of the genetic regulatory networks, allows simultaneous signal recovery and compression, thereby reducing the amount of required observations. The smoothing improves the estimation by incorporating all observations. We track the time-varying networks during the life cycle of the Drosophila melanogaster. The recovered networks show that few genes are permanent, whereas most are transient, acting only during specific developmental phases of the organism.
doi:10.1186/1687-4153-2014-3
PMCID: PMC3974129
PMID: 24517200
Linear algebraic concept of subspace plays a significant role in the recent techniques of spectrum estimation. In this article, the authors have utilized the noise subspace concept for finding hidden periodicities in DNA sequence. With the vast growth of genomic sequences, the demand to identify accurately the protein-coding regions in DNA is increasingly rising. Several techniques of DNA feature extraction which involves various cross fields have come up in the recent past, among which application of digital signal processing tools is of prime importance. It is known that coding segments have a 3-base periodicity, while non-coding regions do not have this unique feature. One of the most important spectrum analysis techniques based on the concept of subspace is the least-norm method. The least-norm estimator developed in this paper shows sharp period-3 peaks in coding regions completely eliminating background noise. Comparison of proposed method with existing sliding discrete Fourier transform (SDFT) method popularly known as modified periodogram method has been drawn on several genes from various organisms and the results show that the proposed method has better as well as an effective approach towards gene prediction. Resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate are used to establish superiority of least-norm gene prediction method over existing method.
doi:10.1186/1687-4153-2014-2
PMCID: PMC3895782
PMID: 24386895
Periodogram; Deoxyribonucleic acid; Least-norm solution; Eigenvector; Eigenvalue
In this paper, we first present a new concept of ‘weight’ for 64 triplets and define a different weight for each kind of triplet. Then, we give a novel 2D graphical representation for DNA sequences, which can transform a DNA sequence into a plot set to facilitate quantitative comparisons of DNA sequences. Thereafter, associating with a newly designed measure of similarity, we introduce a novel approach to make similarities/dissimilarities analysis of DNA sequences. Finally, the applications in similarities/dissimilarities analysis of the complete coding sequences of β-globin genes of 11 species illustrate the utilities of our newly proposed method.
doi:10.1186/1687-4153-2014-1
PMCID: PMC3896961
PMID: 24383852
Graphical representation; Similarities/dissimilarities analysis; Triplet; DNA sequence
The extended Kalman filter (EKF) has been applied to inferring gene regulatory
networks. However, it is well known that the EKF becomes less accurate when the
system exhibits high nonlinearity. In addition, certain prior information about
the gene regulatory network exists in practice, and no systematic approach has
been developed to incorporate such prior information into the Kalman-type filter
for inferring the structure of the gene regulatory network. In this paper, an
inference framework based on point-based Gaussian approximation filters that can
exploit the prior information is developed to solve the gene regulatory network
inference problem. Different point-based Gaussian approximation filters, including
the unscented Kalman filter (UKF), the third-degree cubature Kalman filter
(CKF3), and the fifth-degree cubature Kalman filter
(CKF5) are employed. Several types of network prior information,
including the existing network structure information, sparsity assumption, and the
range constraint of parameters, are considered, and the corresponding filters
incorporating the prior information are developed. Experiments on a synthetic
network of eight genes and the yeast protein synthesis network of five genes are
carried out to demonstrate the performance of the proposed framework. The results
show that the proposed methods provide more accurate inference results than
existing methods, such as the EKF and the traditional UKF.
doi:10.1186/1687-4153-2013-16
PMCID: PMC3977693
PMID: 24341668
Gene regulatory network; Point-based Gaussian approximation filters; Network prior information; Sparsity; Iterative thresholding
Hannah Arendt, one of the foremost political philosophers of the twentieth century, has argued that it is the responsibility of educators not to leave children in their own world but instead to bring them into the adult world so that, as adults, they can carry civilization forward to whatever challenges it will face by bringing to bear the learning of the past. In the same collection of essays, she discusses the recognition by modern science that Nature is inconceivable in terms of ordinary human conceptual categories - as she writes, ‘unthinkable in terms of pure reason’. Together, these views on scientific education lead to an educational process that transforms children into adults, with a scientific adult being one who has the ability to conceptualize scientific systems independent of ordinary physical intuition. This article begins with Arendt’s basic educational and scientific points and develops from them a critique of current scientific education in conjunction with an appeal to educate young scientists in a manner that allows them to fulfill their potential ‘on the shoulders of giants’. While the article takes a general philosophical perspective, its specifics tend to be directed at biomedical education, in particular, how such education pertains to translational science.
doi:10.1186/1687-4153-2013-15
PMCID: PMC3826847
PMID: 24215841
Copy number alterations (CNAs) can be observed in most of cancer patients. Several oncogenes and tumor suppressor genes with CNAs have been identified in different kinds of tumor. However, the systematic survey of CNA-affected functions is still lack. By employing systems biology approaches, instead of examining individual genes, we directly identified the functional hotspots on human genome. A total of 838 hotspots on human genome with 540 enriched Gene Ontology functions were identified. Seventy-six aCGH array data of hepatocellular carcinoma (HCC) tumors were employed in this study. A total of 150 regions which putatively affected by CNAs and the encoded functions were identified. Our results indicate that two immune related hotspots had copy number alterations in most of patients. In addition, our data implied that these immune-related regions might be involved in HCC oncogenesis. Also, we identified 39 hotspots of which copy number status were associated with patient survival. Our data implied that copy number alterations of the regions may contribute in the dysregulation of the encoded functions. These results further demonstrated that our method enables researchers to survey biological functions of CNAs and to construct regulation hypothesis at pathway and functional levels.
doi:10.1186/1687-4153-2013-14
PMCID: PMC3833309
PMID: 24160471
Copy number alteration; Gene set enrichment; Pathway analysis; Liver cancer
We propose methods to integrate data across several genomic platforms using a hierarchical Bayesian analysis framework that incorporates the biological relationships among the platforms to identify genes whose expression is related to clinical outcomes in cancer. This integrated approach combines information across all platforms, leading to increased statistical power in finding these predictive genes, and further provides mechanistic information about the manner in which the gene affects the outcome. We demonstrate the advantages of the shrinkage estimation used by this approach through a simulation, and finally, we apply our method to a Glioblastoma Multiforme dataset and identify several genes potentially associated with the patients’ survival. We find 12 positive prognostic markers associated with nine genes and 13 negative prognostic markers associated with nine genes.
doi:10.1186/1687-4153-2013-13
PMCID: PMC3849593
PMID: 24053265
Bayesian modeling; Genomics; Hierarchical models; Integrative analysis; Shrinkage priors
Interaction among different risk factors plays an important role in the development and progress of complex disease, such as diabetes. However, traditional epidemiological methods often focus on analyzing individual or a few ‘essential’ risk factors, hopefully to obtain some insights into the etiology of complex disease. In this paper, we propose a systematic framework for risk factor analysis based on a synergy network, which enables better identification of potential risk factors that may serve as prognostic markers for complex disease. A spectral approximate algorithm is derived to solve this network optimization problem, which leads to a new network-based feature ranking method that improves the traditional feature ranking by taking into account the pairwise synergistic interactions among risk factors in addition to their individual predictive power. We first evaluate the performance of our method based on simulated datasets, and then, we use our method to study immunologic and metabolic indices based on the Diabetes Prevention Trial-Type 1 (DPT-1) study that may provide prognostic and diagnostic information regarding the development of type 1 diabetes. The performance comparison based on both simulated and DPT-1 datasets demonstrates that our network-based ranking method provides prognostic markers with higher predictive power than traditional analysis based on individual factors.
doi:10.1186/1687-4153-2013-12
PMCID: PMC3849336
PMID: 24050757
DPT-1; Type 1 diabetes; Biomarker identification; Interaction; Synergy network; Feature ranking
Networks of molecular interactions regulate key processes in living cells. Therefore, understanding their functionality is a high priority in advancing biological knowledge. Boolean networks are often used to describe cellular networks mathematically and are fitted to experimental datasets. The fitting often results in ambiguities since the interpretation of the measurements is not straightforward and since the data contain noise. In order to facilitate a more reliable mapping between datasets and Boolean networks, we develop an algorithm that infers network trajectories from a dataset distorted by noise. We analyze our algorithm theoretically and demonstrate its accuracy using simulation and microarray expression data.
doi:10.1186/1687-4153-2013-11
PMCID: PMC3850440
PMID: 24006954
Boolean network; Inference; Conditional entropy; Gradient descent
A typical small-sample biomarker classification paper discriminates between types of pathology based on, say, 30,000 genes and a small labeled sample of less than 100 points. Some classification rule is used to design the classifier from this data, but we are given no good reason or conditions under which this algorithm should perform well. An error estimation rule is used to estimate the classification error on the population using the same data, but once again we are given no good reason or conditions under which this error estimator should produce a good estimate, and thus we do not know how well the classifier should be expected to perform. In fact, virtually, in all such papers the error estimate is expected to be highly inaccurate. In short, we are given no justification for any claims.
Given the ubiquity of vacuous small-sample classification papers in the literature, one could easily conclude that scientific knowledge is impossible in small-sample settings. It is not that thousands of papers overtly claim that scientific knowledge is impossible in regard to their content; rather, it is that they utilize methods that preclude scientific knowledge. In this paper, we argue to the contrary that scientific knowledge in small-sample classification is possible provided there is sufficient prior knowledge. A natural way to proceed, discussed herein, is via a paradigm for pattern recognition in which we incorporate prior knowledge in the whole classification procedure (classifier design and error estimation), optimize each step of the procedure given available information, and obtain theoretical measures of performance for both classifiers and error estimators, the latter being the critical epistemological issue. In sum, we can achieve scientific validation for a proposed small-sample classifier and its error estimate.
doi:10.1186/1687-4153-2013-10
PMCID: PMC3765562
PMID: 23958425
Background
Recent advances in genome technologies and the subsequent collection of genomic information at various molecular resolutions hold promise to accelerate the discovery of new therapeutic targets. A critical step in achieving these goals is to develop efficient clinical prediction models that integrate these diverse sources of high-throughput data. This step is challenging due to the presence of high-dimensionality and complex interactions in the data. For predicting relevant clinical outcomes, we propose a flexible statistical machine learning approach that acknowledges and models the interaction between platform-specific measurements through nonlinear kernel machines and borrows information within and between platforms through a hierarchical Bayesian framework. Our model has parameters with direct interpretations in terms of the effects of platforms and data interactions within and across platforms. The parameter estimation algorithm in our model uses a computationally efficient variational Bayes approach that scales well to large high-throughput datasets.
Results
We apply our methods of integrating gene/mRNA expression and microRNA profiles for predicting patient survival times to The Cancer Genome Atlas (TCGA) based glioblastoma multiforme (GBM) dataset. In terms of prediction accuracy, we show that our non-linear and interaction-based integrative methods perform better than linear alternatives and non-integrative methods that do not account for interactions between the platforms. We also find several prognostic mRNAs and microRNAs that are related to tumor invasion and are known to drive tumor metastasis and severe inflammatory response in GBM. In addition, our analysis reveals several interesting mRNA and microRNA interactions that have known implications in the etiology of GBM.
Conclusions
Our approach gains its flexibility and power by modeling the non-linear interaction structures between and within the platforms. Our framework is a useful tool for biomedical researchers, since clinical prediction using multi-platform genomic information is an important step towards personalized treatment of many cancers. We have a freely available software at: http://odin.mdacc.tmc.edu/~vbaladan.
doi:10.1186/1687-4153-2013-9
PMCID: PMC3726335
PMID: 23809014
Bayesian modeling; Multiple kernel learning; Genomics; High-dimensional data analysis; Prediction; Variational inference
DNA methylation plays an important role in many biological processes by regulating gene expression. It is commonly accepted that turning on the DNA methylation leads to silencing of the expression of the corresponding genes. While methylation is often described as a binary on-off signal, it is typically measured using beta values derived from either microarray or sequencing technologies, which takes continuous values between 0 and 1. If we would like to interpret methylation in a binary fashion, appropriate thresholds are needed to dichotomize the continuous measurements. In this paper, we use data from The Cancer Genome Atlas project. For a total of 992 samples across five cancer types, both methylation and gene expression data are available. A bivariate extension of the StepMiner algorithm is used to identify thresholds for dichotomizing both methylation and expression data. Hypergeometric test is applied to identify CpG sites whose methylation status is significantly associated to silencing of the expression of their corresponding genes. The test is performed on either all five cancer types together or individual cancer types separately. We notice that the appropriate thresholds vary across different CpG sites. In addition, the negative association between methylation and expression is highly tissue specific.
doi:10.1186/1687-4153-2013-8
PMCID: PMC3680080
PMID: 23742247
Paul Dan Cristea, professor of Electrical Engineering and Computer Science at ‘Politehnica’ University of Bucharest died on 17 April 2013, following several years of bravely battling a perfidious illness.
doi:10.1186/1687-4153-2013-7
PMCID: PMC3656793
PMID: 23663854
Consider a large Boolean network with a feed forward structure. Given a probability distribution on the inputs, can one find, possibly small, collections of input nodes that determine the states of most other nodes in the network? To answer this question, a notion that quantifies the determinative power of an input over the states of the nodes in the network is needed. We argue that the mutual information (MI) between a given subset of the inputs X={X1,...,Xn} of some node i and its associated function fi(X) quantifies the determinative power of this set of inputs over node i. We compare the determinative power of a set of inputs to the sensitivity to perturbations to these inputs, and find that, maybe surprisingly, an input that has large sensitivity to perturbations does not necessarily have large determinative power. However, for unate functions, which play an important role in genetic regulatory networks, we find a direct relation between MI and sensitivity to perturbations. As an application of our results, we analyze the large-scale regulatory network of Escherichia coli. We identify the most determinative nodes and show that a small subset of those reduces the overall uncertainty of the network state significantly. Furthermore, the network is found to be tolerant to perturbations of its inputs.
doi:10.1186/1687-4153-2013-6
PMCID: PMC3748841
PMID: 23642003
Clustering is an important data processing tool for interpreting microarray data and genomic network inference. In this article, we propose a clustering algorithm based on the hierarchical Dirichlet processes (HDP). The HDP clustering introduces a hierarchical structure in the statistical model which captures the hierarchical features prevalent in biological data such as the gene express data. We develop a Gibbs sampling algorithm based on the Chinese restaurant metaphor for the HDP clustering. We apply the proposed HDP algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to outperform several popular clustering algorithms by revealing the underlying hierarchical structure of the data. For the yeast cell cycle data, we compare the HDP result to the standard result and show that the HDP algorithm provides more information and reduces the unnecessary clustering fragments.
doi:10.1186/1687-4153-2013-5
PMCID: PMC3656798
PMID: 23587447
One of the major challenges in complex systems biology is that of providing a general theoretical framework to describe the phenomena involved in cell differentiation, i.e., the process whereby stem cells, which can develop into different types, become progressively more specialized. The aim of this study is to briefly review a dynamical model of cell differentiation which is able to cover a broad spectrum of experimentally observed phenomena and to present some novel results.
doi:10.1186/1687-4153-2013-4
PMCID: PMC3599082
PMID: 23421492
Cell differentiation; Dynamical model; Boolean networks; Ergodic sets
Background
There is a growing body of evidence associating microRNAs (miRNAs) with human diseases. MiRNAs are new key players in the disease paradigm demonstrating roles in several human diseases. The functional association between miRNAs and diseases remains largely unclear and far from complete. With the advent of high-throughput functional genomics techniques that infer genes and biological pathways dysregulted in diseases, it is now possible to infer functional association between diseases and biological molecules by integrating disparate biological information.
Results
Here, we first used Lasso regression model to identify miRNAs associated with disease signature as a proof of concept. Then we proposed an integrated approach that uses disease-gene associations from microarray experiments and text mining, and miRNA-gene association from computational predictions and protein networks to build functional associations network between miRNAs and diseases. The findings of the proposed model were validated against gold standard datasets using ROC analysis and results were promising (AUC=0.81). Our protein network-based approach discovered 19 new functional associations between prostate cancer and miRNAs. The new 19 associations were validated using miRNA expression data and clinical profiles and showed to act as diagnostic and prognostic prostate biomarkers. The proposed integrated approach allowed us to reconstruct functional associations between miRNAs and human diseases and uncovered functional roles of newly discovered miRNAs.
Conclusions
Lasso regression was used to find associations between diseases and miRNAs using their gene signature. Defining miRNA gene signature by integrating the downstream effect of miRNAs demonstrated better performance than the miRNA signature alone. Integrating biological networks and multiple data to define miRNA and disease gene signature demonstrated high performance to uncover new functional associations between miRNAs and diseases.
doi:10.1186/1687-4153-2013-3
PMCID: PMC3606436
PMID: 23339438
miRNA; Protein interactions; Systems biology; Disease; Regression modeling
In the clinical practice, many diseases such as glioblastoma, leukemia, diabetes, and prostates have multiple subtypes. Classifying subtypes accurately using genomic data will provide individualized treatments to target-specific disease subtypes. However, it is often difficult to obtain satisfactory classification accuracy using only one type of data, because the subtypes of a disease can exhibit similar patterns in one data type. Fortunately, multiple types of genomic data are often available due to the rapid development of genomic techniques. This raises the question on whether the classification performance can significantly be improved by combining multiple types of genomic data. In this article, we classified four subtypes of glioblastoma multiforme (GBM) with multiple types of genome-wide data (e.g., mRNA and miRNA expression) from The Cancer Genome Atlas (TCGA) project. We proposed a multi-class compressed sensing-based detector (MCSD) for this study. The MCSD was trained with data from TCGA and then applied to subtype GBM patients using an independent testing data. We performed the classification on the same patient subjects with three data types, i.e., miRNA expression data, mRNA (or gene expression) data, and their combinations. The classification accuracy is 69.1% with the miRNA expression data, 52.7% with mRNA expression data, and 90.9% with the combination of both mRNA and miRNA expression data. In addition, some biomarkers identified by the integrated approaches have been confirmed with results from the published literatures. These results indicate that the combined analysis can significantly improve the accuracy of classifying GBM subtypes and identify potential biomarkers for disease diagnosis.
doi:10.1186/1687-4153-2013-2
PMCID: PMC3651309
PMID: 23311594
Glioblastoma; Data integration; Compressed sensing; Classification; mRNA; miRNA
Transcriptional regulation networks are often modeled as Boolean networks. We discuss certain properties of Boolean functions (BFs), which are considered as important in such networks, namely, membership to the classes of unate or canalizing functions. Of further interest is the average sensitivity (AS) of functions. In this article, we discuss several algorithms to test the properties of interest. To test canalizing properties of functions, we apply spectral techniques, which can also be used to characterize the AS of functions as well as the influences of variables in unate BFs. Further, we provide and review upper and lower bounds on the AS of unate BFs based on the spectral representation. Finally, we apply these methods to a transcriptional regulation network of Escherichia coli, which controls central parts of the E. coli metabolism. We find that all functions are unate. Also the analysis of the AS of the network reveals an exceptional robustness against transient fluctuations of the binary variables.a
doi:10.1186/1687-4153-2013-1
PMCID: PMC3605186
PMID: 23311536
Regulatory Boolean networks; Boolean networks; Linear threshold functions; Unate functions; Canalizing function; Sensitivity; Average sensitivity; Restricted functions; Escherichia coli