In the past 15 years, new “omics” technologies have made it possible to obtain high-resolution molecular snapshots of organisms, tissues, and even individual cells at various disease states and experimental conditions. It is hoped that these developments will usher in a new era of personalized medicine in which an individual’s molecular measurements are used to diagnose disease, guide therapy, and perform other tasks more accurately and effectively than is possible using standard approaches. There now exists a vast literature of reported “molecular signatures”. However, despite some notable exceptions, many of these signatures have suffered from limited reproducibility in independent datasets, insufficient sensitivity or specificity to meet clinical needs, or other challenges. In this paper, we discuss the process of molecular signature discovery on the basis of omics data. In particular, we highlight potential pitfalls in the discovery process, as well as strategies that can be used to increase the odds of successful discovery. Despite the difficulties that have plagued the field of molecular signature discovery, we remain optimistic about the potential to harness the vast amounts of available omics data in order to substantially impact clinical practice.
Diagnostics; Disease classification; Systems biology; Translational bioinformatics
Characterizing interactions between drugs is important to avoid potentially harmful combinations, to reduce off-target effects of treatments and to fight antibiotic resistant pathogens, among others. Here we present a network inference algorithm to predict uncharacterized drug-drug interactions. Our algorithm takes, as its only input, sets of previously reported interactions, and does not require any pharmacological or biochemical information about the drugs, their targets or their mechanisms of action. Because the models we use are abstract, our approach can deal with adverse interactions, synergistic/antagonistic/suppressing interactions, or any other type of drug interaction. We show that our method is able to accurately predict interactions, both in exhaustive pairwise interaction data between small sets of drugs, and in large-scale databases. We also demonstrate that our algorithm can be used efficiently to discover interactions of new drugs as part of the drug discovery process.
Over one in four adults older than 57 in the US take five or more prescriptions at the same time; as many as 4% are at risk of a major adverse drug-drug interaction. Potentially beneficial effects of drug combinations, on the other hand, are also important. For example, combinations of drugs with synergistic effects increase the efficacy of treatments and reduce side effects; and suppressing interactions between drugs, in which one drug inhibits the action of the other, have been found to be effective in the fight against antibiotic-resistant pathogens. With thousands of drugs in the market, and hundreds or thousands being tested and developed, it is clear that we cannot rely only on experimental assays, or even mechanistic pharmacological models, to uncover new interactions. Here we present an algorithm that is able to predict such interactions. Our algorithm is parameter-free, unsupervised, and takes, as its only input, sets of previously reported interactions. We show that our method is able to accurately predict interactions, even in large-scale databases containing thousands of drugs, and that it can be used efficiently to discover interactions of new drugs as part of the drug discovery process.
There is a strong need for computational frameworks that integrate different biological processes and data-types to unravel cellular regulation. Current efforts to reconstruct transcriptional regulatory networks (TRNs) focus primarily on proximal data such as gene co-expression and transcription factor (TF) binding. While such approaches enable rapid reconstruction of TRNs, the overwhelming combinatorics of possible networks limits identification of mechanistic regulatory interactions. Utilizing growth phenotypes and systems-level constraints to inform regulatory network reconstruction is an unmet challenge. We present our approach Gene Expression and Metabolism Integrated for Network Inference (GEMINI) that links a compendium of candidate regulatory interactions with the metabolic network to predict their systems-level effect on growth phenotypes. We then compare predictions with experimental phenotype data to select phenotype-consistent regulatory interactions. GEMINI makes use of the observation that only a small fraction of regulatory network states are compatible with a viable metabolic network, and outputs a regulatory network that is simultaneously consistent with the input genome-scale metabolic network model, gene expression data, and TF knockout phenotypes. GEMINI preferentially recalls gold-standard interactions (p-value = 10−172), significantly better than using gene expression alone. We applied GEMINI to create an integrated metabolic-regulatory network model for Saccharomyces cerevisiae involving 25,000 regulatory interactions controlling 1597 metabolic reactions. The model quantitatively predicts TF knockout phenotypes in new conditions (p-value = 10−14) and revealed potential condition-specific regulatory mechanisms. Our results suggest that a metabolic constraint-based approach can be successfully used to help reconstruct TRNs from high-throughput data, and highlights the potential of using a biochemically-detailed mechanistic framework to integrate and reconcile inconsistencies across different data-types. The algorithm and associated data are available at https://sourceforge.net/projects/gemini-data/
Cellular networks, such as metabolic and transcriptional regulatory networks (TRNs), do not operate independently but work together in unison to determine cellular phenotypes. Further, the phenotype and architecture of one network constrains the topology of other networks. Hence, it is critical to study network components and interactions in the context of the entire cell. Typically, efforts to reconstruct TRNs focus only on immediately proximal data such as gene co-expression and transcription factor (TF)-binding. Herein, we take a different strategy by linking candidate TRNs with the metabolic network to predict systems-level responses such as growth phenotypes of TF knockout strains, and compare predictions with experimental phenotype data to select amongst the candidate TRNs. Our approach goes beyond traditional data integration approaches for network inference and refinement by using a predictive network model (metabolism) to refine another network model (regulation) – thus providing an alternative avenue to this area of research. Understanding how the networks function together in a cell will pave the way for synthetic biology and has a wide-range of applications in biotechnology, drug discovery and diagnostics. Further we demonstrate how metabolic models can integrate and reconcile inconsistencies across different data-types.
The transcriptional networks that regulate gene expression and modifications to this network are at the core of the cancer phenotype. MicroRNAs, a well-studied species of small non-coding RNA molecules, have been shown to have a central role in regulating gene expression as part of this transcriptional network. Further, microRNA deregulation is associated with cancer development and with tumor progression. Glioblastoma Multiform (GBM) is the most common, aggressive and malignant primary tumor of the brain and is associated with one of the worst 5-year survival rates among all human cancers. To study the transcriptional network and its modifications in GBM, we utilized gene expression, microRNA sequencing, whole genome sequencing and clinical data from hundreds of patients from different datasets. Using these data and a novel microRNA-gene association approach we introduce, we have identified unique microRNAs and their associated genes. This unique behavior is composed of the ability of the quantifiable association of the microRNA and the gene expression levels, which we show stratify patients into clinical subgroups of high statistical significance. Importantly, this stratification goes unobserved by other methods and is not affiliated by other subsets or phenotypes within the data. To investigate the robustness of the introduced approach, we demonstrate, in unrelated datasets, robustness of findings. Among the set of identified microRNA-gene associations, we closely study the example of MAF and hsa-miR-330-3p, and show how their co-behavior stratifies patients into prognosis clinical groups and how whole genome sequences tells us more about a specific genomic variation as a possible basis for patient variances. We argue that these identified associations may indicate previously unexplored specific disease control mechanisms and may be used as basis for further study and for possible therapeutic intervention.
Despite major progress and improved understanding of Glioblastoma Multiforme, the disease is still associated with poor prognosis. The identification of genomic regulatory mechanisms, their affiliation with clinical outcome and the association between specific modifications in genome sequence that can explain gain and loss of such regulatory activity, combine to suggest specific disease mechanisms and possible means of intervention in the course of the disease. We report here a method and its implementation in exposing possible regulatory mechanisms in GBM. At the core of this method is the employment of associations between micro RNAs and genes as a quantifiable metric. Identification of these associations and their affiliation with clinical features, combined with the availability of whole genome sequences, brings forward specific micro RNAs and their associated genes. Affiliation of specific genomic sequences with clinical outcome thus translates personal genomics into tumor relevant decision-making.
Astrocytoma is the most common glioma, accounting for half of all primary brain and spinal cord tumors. Late detection and the aggressive nature of high-grade astrocytomas contribute to high mortality rates. Though many studies identify candidate biomarkers using high-throughput transcriptomic profiling to stratify grades and subtypes, few have resulted in clinically actionable results. This shortcoming can be attributed, in part, to pronounced lab effects that reduce signature robustness and varied individual gene expression among patients with the same tumor. We addressed these issues by uniformly preprocessing publicly available transcriptomic data, comprising 306 tumor samples from three astrocytoma grades (Grade 2, 3, and 4) and 30 non-tumor samples (normal brain as control tissues). Utilizing Differential Rank Conservation (DIRAC), a network-based classification approach, we examined the global and individual patterns of network regulation across tumor grades. Additionally, we applied gene-based approaches to identify genes whose expression changed consistently with increasing tumor grade and evaluated their robustness across multiple studies using statistical sampling. Applying DIRAC, we observed a global trend of greater network dysregulation with increasing tumor aggressiveness. Individual networks displaying greater differences in regulation between adjacent grades play well-known roles in calcium/PKC, EGF, and transcription signaling. Interestingly, many of the 90 individual genes found to monotonically increase or decrease with astrocytoma grade are implicated in cancer-affected processes such as calcium signaling, mitochondrial metabolism, and apoptosis. The fact that specific genes monotonically increase or decrease with increasing astrocytoma grade may reflect shared oncogenic mechanisms among phenotypically similar tumors. This work presents statistically significant results that enable better characterization of different human astrocytoma grades and hopefully can contribute towards improvements in diagnosis and therapy choices. Our results also identify a number of testable hypotheses relating to astrocytoma etiology that may prove helpful in developing much-needed biomarkers for earlier disease detection.
Dysfunction in energy metabolism—including in pathways localized to the mitochondria—has been implicated in the pathogenesis of a wide array of disorders, ranging from cancer to neurodegenerative diseases to type II diabetes. The inherent complexities of energy and mitochondrial metabolism present a significant obstacle in the effort to understand the role that these molecular processes play in the development of disease. To help unravel these complexities, systems biology methods have been applied to develop an array of computational metabolic models, ranging from mitochondria-specific processes to genome-scale cellular networks. These constraint-based (CB) models can efficiently simulate aspects of normal and aberrant metabolism in various genetic and environmental conditions. Development of these models leverages—and also provides a powerful means to integrate and interpret—information from a wide range of sources including genomics, proteomics, metabolomics, and enzyme kinetics. Here, we review a variety of mechanistic modeling studies that explore metabolic functions, deficiency disorders, and aberrant biochemical pathways in mitochondria and related regions in the cell.
mitochondria; energy metabolism; constraint-based models; flux balance analysis; Warburg effect; systems biology
Methanosarcina acetivorans strain C2A is a marine methanogenic archaeon notable for its substrate utilization, genetic tractability, and novel energy conservation mechanisms. To help probe the phenotypic implications of this organism's unique metabolism, we have constructed and manually curated a genome-scale metabolic model of M. acetivorans, iMB745, which accounts for 745 of the 4,540 predicted protein-coding genes (16%) in the M. acetivorans genome. The reconstruction effort has identified key knowledge gaps and differences in peripheral and central metabolism between methanogenic species. Using flux balance analysis, the model quantitatively predicts wild-type phenotypes and is 96% accurate in knockout lethality predictions compared to currently available experimental data. The model was used to probe the mechanisms and energetics of by-product formation and growth on carbon monoxide, as well as the nature of the reaction catalyzed by the soluble heterodisulfide reductase HdrABC in M. acetivorans. The genome-scale model provides quantitative and qualitative hypotheses that can be used to help iteratively guide additional experiments to further the state of knowledge about methanogenesis.
Updates to maintain a state-of-the art reconstruction of the yeast metabolic network are essential to reflect our understanding of yeast metabolism and functional organization, to eliminate any inaccuracies identified in earlier iterations, to improve predictive accuracy and to continue to expand into novel subsystems to extend the comprehensiveness of the model. Here, we present version 6 of the consensus yeast metabolic network (Yeast 6) as an update to the community effort to computationally reconstruct the genome-scale metabolic network of Saccharomyces cerevisiae S288c. Yeast 6 comprises 1458 metabolites participating in 1888 reactions, which are annotated with 900 yeast genes encoding the catalyzing enzymes. Compared with Yeast 5, Yeast 6 demonstrates improved sensitivity, specificity and positive and negative predictive values for predicting gene essentiality in glucose-limited aerobic conditions when analyzed with flux balance analysis. Additionally, Yeast 6 improves the accuracy of predicting the likelihood that a mutation will cause auxotrophy. The network reconstruction is available as a Systems Biology Markup Language (SBML) file enriched with Minimium Information Requested in the Annotation of Biochemical Models (MIRIAM)-compliant annotations. Small- and macromolecules in the network are referenced to authoritative databases such as Uniprot or ChEBI. Molecules and reactions are also annotated with appropriate publications that contain supporting evidence. Yeast 6 is freely available at http://yeast.sf.net/ as three separate SBML files: a model using the SBML level 3 Flux Balance Constraint package, a model compatible with the MATLAB® COBRA Toolbox for backward compatibility and a reconstruction containing only reactions for which there is experimental evidence (without the non-biological reactions necessary for simulating growth).
We utilized abundant transcriptomic data for the primary classes of brain cancers to study the feasibility of separating all of these diseases simultaneously based on molecular data alone. These signatures were based on a new method reported herein – Identification of Structured Signatures and Classifiers (ISSAC) – that resulted in a brain cancer marker panel of 44 unique genes. Many of these genes have established relevance to the brain cancers examined herein, with others having known roles in cancer biology. Analyses on large-scale data from multiple sources must deal with significant challenges associated with heterogeneity between different published studies, for it was observed that the variation among individual studies often had a larger effect on the transcriptome than did phenotype differences, as is typical. For this reason, we restricted ourselves to studying only cases where we had at least two independent studies performed for each phenotype, and also reprocessed all the raw data from the studies using a unified pre-processing pipeline. We found that learning signatures across multiple datasets greatly enhanced reproducibility and accuracy in predictive performance on truly independent validation sets, even when keeping the size of the training set the same. This was most likely due to the meta-signature encompassing more of the heterogeneity across different sources and conditions, while amplifying signal from the repeated global characteristics of the phenotype. When molecular signatures of brain cancers were constructed from all currently available microarray data, 90% phenotype prediction accuracy, or the accuracy of identifying a particular brain cancer from the background of all phenotypes, was found. Looking forward, we discuss our approach in the context of the eventual development of organ-specific molecular signatures from peripheral fluids such as the blood.
From a multi-study, integrated transcriptomic dataset, we identified a marker panel for differentiating major human brain cancers at the gene-expression level. The ISSAC molecular signatures for brain cancers, composed of 44 unique genes, are based on comparing expression levels of pairs of genes, and phenotype prediction follows a diagnostic hierarchy. We found that sufficient dataset integration across multiple studies greatly enhanced diagnostic performance on truly independent validation sets, whereas signatures learned from only one dataset typically led to high error rate. Molecular signatures of brain cancers, when obtained using all currently available gene-expression data, achieved 90% phenotype prediction accuracy. Thus, our integrative approach holds significant promise for developing organ-level, comprehensive, molecular signatures of disease.
The explosion of biomedical data, both on the genomic and proteomic side as well as clinical data, will require complex integration and analysis to provide new molecular variables to better understand the molecular basis of phenotype. Currently, much data exist in silos and is not analyzed in frameworks where all data are brought to bear in the development of biomarkers and novel functional targets. This is beginning to change. Network biology approaches, which emphasize the interactions between genes, proteins and metabolites provide a framework for data integration such that genome, proteome, metabolome and other -omics data can be jointly analyzed to understand and predict disease phenotypes. In this review, recent advances in network biology approaches and results are identified. A common theme is the potential for network analysis to provide multiplexed and functionally connected biomarkers for analyzing the molecular basis of disease, thus changing our approaches to analyzing and modeling genome- and proteome-wide data.
network biology; bioinformatics
Driven by advancements in high-throughput biological technologies and the growing number of sequenced genomes, the construction of in silico models at the genome scale has provided powerful tools to investigate a vast array of biological systems and applications. Here, we review comprehensively the uses of such models in industrial and medical biotechnology, including biofuel generation, food production, and drug development. While the use of in silico models is still in its early stages for delivering to industry, significant initial successes have been achieved. For the cases presented here, genome-scale models predict engineering strategies to enhance properties of interest in an organism or to inhibit harmful mechanisms of pathogens or in disease. Going forward, genome-scale in silico models promise to extend their application and analysis scope to become a transformative tool in biotechnology. As such, genome-scale models can provide a basis for rational genome-scale engineering and synthetic biology.
genome-scale models; constraint-based analysis; cellular networks; systems biology
Public databases such as the NCBI Gene Expression Omnibus contain extensive and exponentially increasing amounts of high-throughput data that can be applied to molecular phenotype characterization. Collectively, these data can be analyzed for such purposes as disease diagnosis or phenotype classification. One family of algorithms that has proven useful for disease classification is based on relative expression analysis and includes the Top-Scoring Pair (TSP), k-Top-Scoring Pairs (k-TSP), Top-Scoring Triplet (TST) and Differential Rank Conservation (DIRAC) algorithms. These relative expression analysis algorithms hold significant advantages for identifying interpretable molecular signatures for disease classification, and have been implemented previously on a variety of computational platforms with varying degrees of usability. To increase the user-base and maximize the utility of these methods, we developed the program AUREA (Adaptive Unified Relative Expression Analyzer)—a cross-platform tool that has a consistent application programming interface (API), an easy-to-use graphical user interface (GUI), fast running times and automated parameter discovery.
Herein, we describe AUREA, an efficient, cohesive, and user-friendly open-source software system that comprises a suite of methods for relative expression analysis. AUREA incorporates existing methods, while extending their capabilities and bringing uniformity to their interfaces. We demonstrate that combining these algorithms and adaptively tuning parameters on the training sets makes these algorithms more consistent in their performance and demonstrate the effectiveness of our adaptive parameter tuner by comparing accuracy across diverse datasets.
We have integrated several relative expression analysis algorithms and provided a unified interface for their implementation while making data acquisition, parameter fixing, data merging, and results analysis ‘point-and-click’ simple. The unified interface and the adaptive parameter tuning of AUREA provide an effective framework in which to investigate the massive amounts of publically available data by both ‘in silico’ and ‘bench’ scientists. AUREA can be found at http://price.systemsbiology.net/AUREA/.
Human tissues perform diverse metabolic functions. Mapping out these tissue-specific functions in genome-scale models will advance our understanding of the metabolic basis of various physiological and pathological processes. The global knowledgebase of metabolic functions categorized for the human genome (Human Recon 1) coupled with abundant high-throughput data now makes possible the reconstruction of tissue-specific metabolic models. However, the number of available tissue-specific models remains incomplete compared with the large diversity of human tissues.
We developed a method called metabolic Context-specificity Assessed by Deterministic Reaction Evaluation (mCADRE). mCADRE is able to infer a tissue-specific network based on gene expression data and metabolic network topology, along with evaluation of functional capabilities during model building. mCADRE produces models with similar or better functionality and achieves dramatic computational speed up over existing methods. Using our method, we reconstructed draft genome-scale metabolic models for 126 human tissue and cell types. Among these, there are models for 26 tumor tissues along with their normal counterparts, and 30 different brain tissues. We performed pathway-level analyses of this large collection of tissue-specific models and identified the eicosanoid metabolic pathway, especially reactions catalyzing the production of leukotrienes from arachidnoic acid, as potential drug targets that selectively affect tumor tissues.
This large collection of 126 genome-scale draft metabolic models provides a useful resource for studying the metabolic basis for a variety of human diseases across many tissues. The functionality of the resulting models and the fast computational speed of the mCADRE algorithm make it a useful tool to build and update tissue-specific metabolic models.
Automated metabolic network reconstruction; Brain; Cancer metabolism; Tissue-specific metabolic model; Constraint-based modeling
A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify “motifs” that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery–searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA “background” sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are “too null,” resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where “ground truth” is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced “over-fitting” in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/.
Relative expression algorithms such as the top-scoring pair (TSP) and the
top-scoring triplet (TST) have several strengths that distinguish them from
other classification methods, including resistance to overfitting,
invariance to most data normalization methods, and biological
interpretability. The top-scoring ‘N’ (TSN) algorithm is a
generalized form of other relative expression algorithms which uses generic
permutations and a dynamic classifier size to control both the permutation
and combination space available for classification.
TSN was tested on nine cancer datasets, showing statistically significant
differences in classification accuracy between different classifier sizes
(choices of N). TSN also performed competitively against a wide
variety of different classification methods, including artificial neural
networks, classification trees, discriminant analysis, k-Nearest neighbor,
naïve Bayes, and support vector machines, when tested on the Microarray
Quality Control II datasets. Furthermore, TSN exhibits low levels of
overfitting on training data compared to other methods, giving confidence
that results obtained during cross validation will be more generally
applicable to external validation sets.
TSN preserves the strengths of other relative expression algorithms while
allowing a much larger permutation and combination space to be explored,
potentially improving classification accuracies when fewer numbers of
measured features are available.
Classification; Top-scoring pair; Relative expression; Cross validation; Support vector machine; Graphics processing unit; Microarray
Proteomics has been applied to study intracellular bacteria and phagocytic vacuoles in different host cell lines, especially macrophages (Mφs). For mycobacterial phagosomes, few studies have identified over several hundred proteins for systems assessment of the phagosome maturation and antigen presentation pathways. More importantly, there has been a scarcity in publication on proteomic characterization of mycobacterial phagosomes in dendritic cells (DCs). In this work, we report a global proteomic analysis of Mφ and DC phagosomes infected with a virulent, an attenuated, and a vaccine strain of mycobacteria. We used label-free quantitative proteomics and bioinformatics tools to decipher the regulation of phagosome maturation and antigen presentation pathways in Mφs and DCs. We found that the phagosomal antigen presentation pathways are repressed more in DCs than in Mφs. The results suggest that virulent mycobacteria might co-opt the host immune system to stimulate granuloma formation for persistence while minimizing the antimicrobial immune response to enhance mycobacterial survival. The studies on phagosomal proteomes have also shown promise in discovering new antigen presentation mechanisms that a professional antigen presentation cell might use to overcome the mycobacterial blockade of conventional antigen presentation pathways.
proteomics; phagosome; macrophage; dendritic cell; antigen presentation; systems biology; Mycobacterium tuberculosis
Summary: The top-scoring pair (TSP) and top-scoring triplet (TST) algorithms are powerful methods for classification from expression data, but analysis of all combinations across thousands of human transcriptome samples is computationally intensive, and has not yet been achieved for TST. Implementation of these algorithms for the graphics processing unit results in dramatic speedup of two orders of magnitude, greatly increasing the searchable combinations and accelerating the pace of discovery.
Supplementary information: Supplementary data are available at Bioinformatics online.
The phenotype of any organism on earth is, in large part, the consequence of interplay between numerous gene products encoded in the genome, and such interplay between gene products affects the evolutionary fate of the genome itself through the resulting phenotype. In this regard, contemporary genomes can be used as molecular records that reveal associations of various genes working in their natural lifestyles. By analyzing thousands of orthologs across ∼600 bacterial species, we constructed a map of gene-gene co-occurrence across much of the sequenced biome. If genes preferentially co-occur in the same organisms, they were called herein correlogs; in the opposite case, called anti-correlogs. To quantify correlogy and anti-correlogy, we alleviated the contribution of indirect correlations between genes by adapting ideas developed for reverse engineering of transcriptional regulatory networks. Resultant correlogous associations are highly enriched for physically interacting proteins and for co-expressed transcripts, clearly differentiating a subgroup of functionally-obligatory protein interactions from conditional or transient interactions. Other biochemical and phylogenetic properties were also found to be reflected in correlogous and anti-correlogous relationships. Additionally, our study elucidates the global organization of the gene association map, in which various modules of correlogous genes are strikingly interconnected by anti-correlogous crosstalk between the modules. We then demonstrate the effectiveness of such associations along different domains of life and environmental microbial communities. These phylogenetic profiling approaches infer functional coupling of genes regardless of mechanistic details, and may be useful to guide exogenous gene import in synthetic biology.
Genes in organisms have a number of interactions with one another in their biological contexts. For example, proteins produced from one gene may interact with other proteins produced from another gene to perform together a particular biological task, and such pairs of cooperative genes may often reside together in the same organisms. We analyzed thousands of genes across ∼600 bacterial species, and found genes with favored co-occurrence in the same organisms (termed correlogs) or disfavored co-occurrence (termed anti-correlogs). These co-occurrence patterns are significantly reflective of actual biochemical interplays between genes, and distinct cliques of correlogous genes are seamlessly interrelated through anti-correlogous links between the cliques. The ‘sociology’ of genes inferred by this approach provides useful information on how to engineer a cell, such as for production of a desired byproduct. For example, an important gene in cellobiose digestion for biofuel production, bglB, is suggested to function better in a cell factory when co-activated with another gene rhaM, the correlogous partner we found in our analysis.
Biofuels derived from lignocellulosic biomass offer promising alternative renewable energy sources for transportation fuels. Significant effort has been made to engineer Saccharomyces cerevisiae to efficiently ferment pentose sugars such as D-xylose and L-arabinose into biofuels such as ethanol through heterologous expression of the fungal D-xylose and L-arabinose pathways. However, one of the major bottlenecks in these fungal pathways is that the cofactors are not balanced, which contributes to inefficient utilization of pentose sugars. We utilized a genome-scale model of S. cerevisiae to predict the maximal achievable growth rate for cofactor balanced and imbalanced D-xylose and L-arabinose utilization pathways. Dynamic flux balance analysis (DFBA) was used to simulate batch fermentation of glucose, D-xylose, and L-arabinose. The dynamic models and experimental results are in good agreement for the wild type and for the engineered D-xylose utilization pathway. Cofactor balancing the engineered D-xylose and L-arabinose utilization pathways simulated an increase in ethanol batch production of 24.7% while simultaneously reducing the predicted substrate utilization time by 70%. Furthermore, the effects of cofactor balancing the engineered pentose utilization pathways were evaluated throughout the genome-scale metabolic network. This work not only provides new insights to the global network effects of cofactor balancing but also provides useful guidelines for engineering a recombinant yeast strain with cofactor balanced engineered pathways that efficiently co-utilizes pentose and hexose sugars for biofuels production. Experimental switching of cofactor usage in enzymes has been demonstrated, but is a time-consuming effort. Therefore, systems biology models that can predict the likely outcome of such strain engineering efforts are highly useful for motivating which efforts are likely to be worth the significant time investment.
Cancer is a complex disease that involves multiple types of biological interactions across diverse physical, temporal, and biological scales. This complexity presents substantial challenges for the characterization of cancer biology, and motivates the study of cancer in the context of molecular, cellular, and physiological systems. Computational models of cancer are being developed to aid both biological discovery and clinical medicine. The development of these in silico models is facilitated by rapidly advancing experimental and analytical tools that generate information-rich, high-throughput biological data. Statistical models of cancer at the genomic, transcriptomic, and pathway levels have proven effective in developing diagnostic and prognostic molecular signatures, as well as in identifying perturbed pathways. Statistically-inferred network models can prove useful in settings where data overfitting can be avoided, and provide an important means for biological discovery. Mechanistically-based signaling and metabolic models that apply a priori knowledge of biochemical processes derived from experiments can also be reconstructed where data are available, and can provide insight and predictive ability regarding the dynamical behavior of these systems. At longer length scales, continuum and agent-based models of the tumor microenvironment and other tissue-level interactions enable modeling of cancer cell populations and tumor progression. Even though cancer has been among the most-studied human diseases using systems approaches, significant challenges remain before the enormous potential of in silico cancer biology can be fully realized.
Systems medicine; Cancer; Personalized medicine; Systems biology; Computational biology
The search for improved molecular cancer diagnostics is a challenge for which systems approaches show great promise. As is becoming increasingly clear, cancer is a perpetually-evolving, highly multi-factorial disease. With next generation sequencing providing an ever-increasing amount of high-throughput data, the need for analytical tools that can provide meaningful context is critical. Systems approaches have demonstrated an ability to separate meaningful signal from noise that arises from population heterogeneity, heterogeneity within and across tumors, and multiple sources of technical variation when sufficient sample sizes are obtained and standardized measurement technologies are used. The ability to develop clinically useful molecular cancer diagnostics will be predicated on advancements on two major fronts: 1) more comprehensive and accurate measurements of multiple endpoints, and 2) more sophisticated analytical tools that synthesize high-throughput data into meaningful reflections of cellular states. To this end, systems approaches that have integrated transcriptomic data onto biomolecular networks have shown promise in their ability to classify tumor subtypes, predict clinical progression, and inform treatment options. Ultimately, the success of systems approaches will be measured by their ability to develop molecular cancer diagnostics through distilling complex, systems-wide information into simple, salient, actionable information.
The computational identification from global data sets of stable and predictive patterns of gene and protein relative expression reversals offers a simple, yet powerful approach to target therapies for personalized medicine and to identify pathways that are disease-perturbed. We previously utilized this approach to identify a molecular classifier with near 100% accuracy for differentiating gastrointestinal stromal tumor (GIST) and leiomyosarcoma (LMS), two cancers that have very similar histopathology, but require very different treatments. Differential Rank Conservation (DIRAC) is a novel approach for studying gene ordering within pathways and is based on the relative expression ranks of participating genes. DIRAC provides quantitative measures of how pathway rankings differ both within and between phenotypes. DIRAC between pathways in a selected phenotype contrasts the scenarios where either (i) pathways are ranked similarly in all samples; or (ii) the ordering of pathway genes is highly varied. We examined gene expression inGIST and LMS tumor profiles and identified pathways that appear to be tightly regulated based on high conservation of gene ordering. The second form of DIRAC manifests as a change in ranking (i.e., shuffling) between phenotypes for a selected pathway. These variably expressed pathways serve as signatures for molecular classification, and the ability to accurately classify microarray samples provided strong validation for the pathway-level expression differences identified by DIRAC.
The enormous amount of biomolecule measurement data generated from high-throughput technologies has brought an increased need for computational tools in biological analyses. Such tools can enhance our understanding of human health and genetic diseases, such as cancer, by accurately classifying phenotypes, detecting the presence of disease, discriminating among cancer sub-types, predicting clinical outcomes, and characterizing disease progression. In the case of gene expression microarray data, standard statistical learning methods have been used to identify classifiers that can accurately distinguish disease phenotypes. However, these mathematical prediction rules are often highly complex, and they lack the convenience and simplicity desired for extracting underlying biological meaning or transitioning into the clinic. In this review, we survey a powerful collection of computational methods for analyzing transcriptomic microarray data that address these limitations. Relative Expression Analysis (RXA) is based only on the relative orderings among the expressions of a small number of genes. Specifically, we provide a description of the first and simplest example of RXA, the k-TSP classifier, which is based on k pairs of genes; the case k = 1 is the TSP classifier. Given their simplicity and ease of biological interpretation, as well as their invariance to data normalization and parameter-fitting, these classifiers have been widely applied in aiding molecular diagnostics in a broad range of human cancers. We review several studies which demonstrate accurate classification of disease phenotypes (e.g., cancer vs. normal), cancer subclasses (e.g., AML vs. ALL, GIST vs. LMS), disease outcomes (e.g., metastasis, survival), and diverse human pathologies assayed through blood-borne leukocytes. The studies presented demonstrate that RXA—specifically the TSP and k-TSP classifiers—is a promising new class of computational methods for analyzing high-throughput data, and has the potential to significantly contribute to molecular cancer diagnosis and prognosis.
relative expression; classification; microarray analysis; computational biology
The development of a complete organism from a single cell involves extraordinarily complex orchestration of biological processes that vary intricately across space and time. Systems biology seeks to describe how all elements of a biological system interact in order to understand, model, and ultimately predict aspects of emergent biological processes. Embryogenesis represents an extraordinary opportunity – and challenge – for the application of systems biology. Systems approaches have already been used successfully to study various aspects of development, from complex intracellular networks to 4D models of organogenesis. Going forward, great advancements and discoveries can be expected from systems approaches applied to embryogenesis and developmental biology.
Development; Regulatory Networks; Computational Models; Organogenesis; Complex Adaptive Systems
A powerful way to separate signal from noise in biology is to convert the molecular data from individual genes or proteins into an analysis of comparative biological network behaviors. One of the limitations of previous network analyses is that they do not take into account the combinatorial nature of gene interactions within the network. We report here a new technique, Differential Rank Conservation (DIRAC), which permits one to assess these combinatorial interactions to quantify various biological pathways or networks in a comparative sense, and to determine how they change in different individuals experiencing the same disease process. This approach is based on the relative expression values of participating genes—i.e., the ordering of expression within network profiles. DIRAC provides quantitative measures of how network rankings differ either among networks for a selected phenotype or among phenotypes for a selected network. We examined disease phenotypes including cancer subtypes and neurological disorders and identified networks that are tightly regulated, as defined by high conservation of transcript ordering. Interestingly, we observed a strong trend to looser network regulation in more malignant phenotypes and later stages of disease. At a sample level, DIRAC can detect a change in ranking between phenotypes for any selected network. Variably expressed networks represent statistically robust differences between disease states and serve as signatures for accurate molecular classification, validating the information about expression patterns captured by DIRAC. Importantly, DIRAC can be applied not only to transcriptomic data, but to any ordinal data type.
The systems approach to medicine derives from the idea that diseased cells arise from one or more perturbed biological networks due to the net effect of interactions among multiple molecular agents; by measuring differences in the abundance of biomolecules (e.g., mRNA, proteins, metabolites) we can identify reporters of network states and uncover molecular signatures of disease. However, a major limitation of previously published network analyses is the focus on small numbers of individual, differentially-expressed genes, hence the failure to take into account combinatorial interactions. We report a new technique, Differential Rank Conservation, for identifying and measuring network-level perturbations. Our rank conservation index is based entirely on the relative levels of expression for participating genes and allows us to detect differences in network orderings between networks for a given phenotype and between phenotypes for a given network. In examining cancer subtypes and neurological disorders, we identified networks that are tightly and loosely regulated, as defined by the level of conservation of transcript ordering, and observed a strong trend to looser network regulation in more malignant phenotypes and later stages of disease. We also demonstrate that variably expressed networks represent robust differences between disease states.