Isolating pure microbial cultures and cultivating them in the laboratory on defined media is used to more fully characterize the metabolism and physiology of organisms. However, identifying an appropriate growth medium for a novel isolate remains a challenging task. Even organisms with sequenced and annotated genomes can be difficult to grow, despite our ability to build genome-scale metabolic networks that connect genomic data with metabolic function. The scientific literature is scattered with information about defined growth media used successfully for cultivating a wide variety of organisms, but to date there exists no centralized repository to inform efforts to cultivate less characterized organisms by bridging the gap between genomic data and compound composition for growth media. Here we present MediaDB, a manually curated database of defined media that have been used for cultivating organisms with sequenced genomes, with an emphasis on organisms with metabolic network models. The database is accessible online, can be queried by keyword searches or downloaded in its entirety, and can generate exportable individual media formulation files. The data assembled in MediaDB facilitate comparative studies of organism growth media, serve as a starting point for formulating novel growth media, and contribute to formulating media for in silico investigation of metabolic networks. MediaDB is freely available for public use at https://mediadb.systemsbiology.net.
In the past 15 years, new “omics” technologies have made it possible to obtain high-resolution molecular snapshots of organisms, tissues, and even individual cells at various disease states and experimental conditions. It is hoped that these developments will usher in a new era of personalized medicine in which an individual’s molecular measurements are used to diagnose disease, guide therapy, and perform other tasks more accurately and effectively than is possible using standard approaches. There now exists a vast literature of reported “molecular signatures”. However, despite some notable exceptions, many of these signatures have suffered from limited reproducibility in independent datasets, insufficient sensitivity or specificity to meet clinical needs, or other challenges. In this paper, we discuss the process of molecular signature discovery on the basis of omics data. In particular, we highlight potential pitfalls in the discovery process, as well as strategies that can be used to increase the odds of successful discovery. Despite the difficulties that have plagued the field of molecular signature discovery, we remain optimistic about the potential to harness the vast amounts of available omics data in order to substantially impact clinical practice.
Diagnostics; Disease classification; Systems biology; Translational bioinformatics
Dysfunction in energy metabolism—including in pathways localized to the mitochondria—has been implicated in the pathogenesis of a wide array of disorders, ranging from cancer to neurodegenerative diseases to type II diabetes. The inherent complexities of energy and mitochondrial metabolism present a significant obstacle in the effort to understand the role that these molecular processes play in the development of disease. To help unravel these complexities, systems biology methods have been applied to develop an array of computational metabolic models, ranging from mitochondria-specific processes to genome-scale cellular networks. These constraint-based (CB) models can efficiently simulate aspects of normal and aberrant metabolism in various genetic and environmental conditions. Development of these models leverages—and also provides a powerful means to integrate and interpret—information from a wide range of sources including genomics, proteomics, metabolomics, and enzyme kinetics. Here, we review a variety of mechanistic modeling studies that explore metabolic functions, deficiency disorders, and aberrant biochemical pathways in mitochondria and related regions in the cell.
mitochondria; energy metabolism; constraint-based models; flux balance analysis; Warburg effect; systems biology
Methanosarcina acetivorans strain C2A is a marine methanogenic archaeon notable for its substrate utilization, genetic tractability, and novel energy conservation mechanisms. To help probe the phenotypic implications of this organism's unique metabolism, we have constructed and manually curated a genome-scale metabolic model of M. acetivorans, iMB745, which accounts for 745 of the 4,540 predicted protein-coding genes (16%) in the M. acetivorans genome. The reconstruction effort has identified key knowledge gaps and differences in peripheral and central metabolism between methanogenic species. Using flux balance analysis, the model quantitatively predicts wild-type phenotypes and is 96% accurate in knockout lethality predictions compared to currently available experimental data. The model was used to probe the mechanisms and energetics of by-product formation and growth on carbon monoxide, as well as the nature of the reaction catalyzed by the soluble heterodisulfide reductase HdrABC in M. acetivorans. The genome-scale model provides quantitative and qualitative hypotheses that can be used to help iteratively guide additional experiments to further the state of knowledge about methanogenesis.
Driven by advancements in high-throughput biological technologies and the growing number of sequenced genomes, the construction of in silico models at the genome scale has provided powerful tools to investigate a vast array of biological systems and applications. Here, we review comprehensively the uses of such models in industrial and medical biotechnology, including biofuel generation, food production, and drug development. While the use of in silico models is still in its early stages for delivering to industry, significant initial successes have been achieved. For the cases presented here, genome-scale models predict engineering strategies to enhance properties of interest in an organism or to inhibit harmful mechanisms of pathogens or in disease. Going forward, genome-scale in silico models promise to extend their application and analysis scope to become a transformative tool in biotechnology. As such, genome-scale models can provide a basis for rational genome-scale engineering and synthetic biology.
genome-scale models; constraint-based analysis; cellular networks; systems biology
The Hundred Person Wellness Project (HPWP) is a 10-month pilot study of 100 ‘well’ individuals where integrated data from whole-genome sequencing, gut microbiome, clinical laboratory tests and quantified self measures from each individual are used to provide actionable results for health coaching with the goal of optimizing wellness and minimizing disease. In a commentary in BMC Medicine, Diamandis argues that HPWP and similar projects will likely result in ‘unnecessary and potential harmful over-testing’. We argue that this new approach will ultimately lead to lower costs, better healthcare, innovation and economic growth. The central points of the HPWP are: 1) it is focused on optimizing wellness through longitudinal data collection, integration and mining of individual data clouds, enabling development of predictive models of wellness and disease that will reveal actionable possibilities; and 2) by extending this study to 100,000 well people, we will establish multiparameter, quantifiable wellness metrics and identify markers for wellness to early disease transitions for most common diseases, which will ultimately allow earlier disease intervention, eventually transitioning the individual early on from a disease back to a wellness trajectory.
Please see related commentary: http://dx.doi.org/10.1186/s12916-014-0239-6.
Wellness; Personalized medicine; Whole-genome sequencing; Health behavior change; Actionable; P4 Medicine; Systems medicine; Gut microbiome
The enormous amount of biomolecule measurement data generated from high-throughput technologies has brought an increased need for computational tools in biological analyses. Such tools can enhance our understanding of human health and genetic diseases, such as cancer, by accurately classifying phenotypes, detecting the presence of disease, discriminating among cancer sub-types, predicting clinical outcomes, and characterizing disease progression. In the case of gene expression microarray data, standard statistical learning methods have been used to identify classifiers that can accurately distinguish disease phenotypes. However, these mathematical prediction rules are often highly complex, and they lack the convenience and simplicity desired for extracting underlying biological meaning or transitioning into the clinic. In this review, we survey a powerful collection of computational methods for analyzing transcriptomic microarray data that address these limitations. Relative Expression Analysis (RXA) is based only on the relative orderings among the expressions of a small number of genes. Specifically, we provide a description of the first and simplest example of RXA, the k-TSP classifier, which is based on k pairs of genes; the case k = 1 is the TSP classifier. Given their simplicity and ease of biological interpretation, as well as their invariance to data normalization and parameter-fitting, these classifiers have been widely applied in aiding molecular diagnostics in a broad range of human cancers. We review several studies which demonstrate accurate classification of disease phenotypes (e.g., cancer vs. normal), cancer subclasses (e.g., AML vs. ALL, GIST vs. LMS), disease outcomes (e.g., metastasis, survival), and diverse human pathologies assayed through blood-borne leukocytes. The studies presented demonstrate that RXA—specifically the TSP and k-TSP classifiers—is a promising new class of computational methods for analyzing high-throughput data, and has the potential to significantly contribute to molecular cancer diagnosis and prognosis.
relative expression; classification; microarray analysis; computational biology
Mycobacterium tuberculosis senses and responds to the shifting and hostile landscape of the host. To characterize the underlying intertwined gene regulatory network governed by approximately 200 transcription factors of M. tuberculosis, we have assayed the global transcriptional consequences of overexpressing each transcription factor from an inducible promoter.
We cloned and overexpressed 206 transcription factors in M. tuberculosis to identify the regulatory signature of each. We identified 9,335 regulatory consequences of overexpressing each of 183 transcription factors, providing evidence of regulation for 70% of the M. tuberculosis genome. These transcriptional signatures agree well with previously described M. tuberculosis regulons. The number of genes differentially regulated by transcription factor overexpression varied from hundreds of genes to none, with the majority of expression changes repressing basal transcription. Exploring the global transcriptional maps of transcription factor overexpressing (TFOE) strains, we predicted and validated the phenotype of a regulator that reduces susceptibility to a first line anti-tubercular drug, isoniazid. We also combined the TFOE data with an existing model of M. tuberculosis metabolism to predict the growth rates of individual TFOE strains with high fidelity.
This work has led to a systems-level framework describing the transcriptome of a devastating bacterial pathogen, characterized the transcriptional influence of nearly all individual transcription factors in M. tuberculosis, and demonstrated the utility of this resource. These results will stimulate additional systems-level and hypothesis-driven efforts to understand M. tuberculosis adaptations that promote disease.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0502-3) contains supplementary material, which is available to authorized users.
Ten years ago, the proposition that healthcare is evolving from reactive disease care to care that is predictive, preventive, personalized and participatory was regarded as highly speculative. Today, the core elements of that vision are widely accepted and have been articulated in a series of recent reports by the US Institute of Medicine. Systems approaches to biology and medicine are now beginning to provide patients, consumers and physicians with personalized information about each individual’s unique health experience of both health and disease at the molecular, cellular and organ levels. This information will make disease care radically more cost effective by personalizing care to each person’s unique biology and by treating the causes rather than the symptoms of disease. It will also provide the basis for concrete action by consumers to improve their health as they observe the impact of lifestyle decisions. Working together in digitally powered familial and affinity networks, consumers will be able to reduce the incidence of the complex chronic diseases that currently account for 75% of disease-care costs in the USA.
big data; knowledge network; learning healthcare; new taxonomy of disease; omics studies; P4 medicine; personal data clouds; systems biology; systems medicine; wellness industry
The biomarker discovery field is replete with molecular signatures that have not translated into the clinic despite ostensibly promising performance in predicting disease phenotypes. One widely cited reason is lack of classification consistency, largely due to failure to maintain performance from study to study. This failure is widely attributed to variability in data collected for the same phenotype among disparate studies, due to technical factors unrelated to phenotypes (e.g., laboratory settings resulting in “batch-effects”) and non-phenotype-associated biological variation in the underlying populations. These sources of variability persist in new data collection technologies.
Here we quantify the impact of these combined “study-effects” on a disease signature’s predictive performance by comparing two types of validation methods: ordinary randomized cross-validation (RCV), which extracts random subsets of samples for testing, and inter-study validation (ISV), which excludes an entire study for testing. Whereas RCV hardwires an assumption of training and testing on identically distributed data, this key property is lost in ISV, yielding systematic decreases in performance estimates relative to RCV. Measuring the RCV-ISV difference as a function of number of studies quantifies influence of study-effects on performance.
As a case study, we gathered publicly available gene expression data from 1,470 microarray samples of 6 lung phenotypes from 26 independent experimental studies and 769 RNA-seq samples of 2 lung phenotypes from 4 independent studies. We find that the RCV-ISV performance discrepancy is greater in phenotypes with few studies, and that the ISV performance converges toward RCV performance as data from additional studies are incorporated into classification.
We show that by examining how fast ISV performance approaches RCV as the number of studies is increased, one can estimate when “sufficient” diversity has been achieved for learning a molecular signature likely to translate without significant loss of accuracy to new clinical settings.
Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to obtain a more accurate network. All described workflows are implemented as part of the DOE Systems Biology Knowledgebase (KBase) and are publicly available via API or command-line web interface.
Genome-scale metabolic modeling is a powerful approach that allows one to computationally simulate a variety of metabolic phenotypes. However, manually constructing accurate metabolic networks is extremely time intensive and it is thus desirable to have automated computational methods for providing high-quality metabolic networks. Incomplete knowledge of biological chemistries leads to missing, ambiguous, or inaccurate gene annotations, and thus gives rise to incomplete metabolic networks. Computational algorithms for filling these gaps in a metabolic model rely on network topology based approaches that can result in solutions that are inconsistent with existing genomic data. We developed an algorithm that directly incorporates genomic evidence into the decision-making process for gap filling reactions. This algorithm both maximizes the consistency of gap filled reactions with available genomic data and identifies candidate genes for gap filled reactions. The algorithm has been integrated into KBase's metabolic modeling service, an automated metabolic network reconstruction framework that includes the ModelSEED automated metabolic reconstruction tools.
Each year millions of pulmonary nodules are discovered by computed tomography and subsequently biopsied. As the majority of these nodules are benign, many patients undergo unnecessary and costly invasive procedures. We present a 13-protein blood-based classifier that differentiates malignant and benign nodules with high confidence, thereby providing a diagnostic tool to avoid invasive biopsy on benign nodules. Using a systems biology strategy, 371 protein candidates were identified and a multiple reaction monitoring (MRM) assay was developed for each. The MRM assays were applied in a three-site discovery study (n = 143) on plasma samples from patients with benign and Stage IA cancer matched on nodule size, age, gender and clinical site, producing a 13-protein classifier. The classifier was validated on an independent set of plasma samples (n = 104), exhibiting a high negative predictive value (NPV) of 90%. Validation performance on samples from a non-discovery clinical site showed NPV of 94%, indicating the general effectiveness of the classifier. A pathway analysis demonstrated that the classifier proteins are likely modulated by a few transcription regulators (NF2L2, AHR, MYC, FOS) that are associated with lung cancer, lung inflammation and oxidative stress networks. The classifier score was independent of patient nodule size, smoking history and age, which are risk factors used for clinical management of pulmonary nodules. Thus this molecular test can provide a powerful complementary tool for physicians in lung cancer diagnosis.
A grand challenge impeding optimal treatment outcomes for cancer patients arises from the complex nature of the disease: the cellular heterogeneity, the myriad of dysfunctional molecular and genetic networks as results of genetic (somatic) and environmental perturbations. Systems biology, with its holistic approach to understanding fundamental principles in biology, and the empowering technologies in genomics, proteomics, single-cell analysis, microfluidics, and computational strategies, enables a comprehensive approach to medicine, which strives to unveil the pathogenic mechanisms of diseases, identify disease biomarkers and begin thinking about new strategies for drug target discovery. The integration of multi-dimensional high throughput “omics” measurements from tumor tissues and corresponding blood specimens, together with new systems strategies for diagnostics, enables the identification of cancer biomarkers that will enable presymptomatic diagnosis, stratification of disease, assessment of disease progression, evaluation of patient response to therapy, and the identification of reoccurrences. While some aspects of systems medicine are being adopted in clinical oncology practice through companion molecular diagnostics for personalized therapy, the mounting influx of global quantitative data from both wellness and diseases, is shaping up a transformational paradigm in medicine we termed predictive, preventive, personalized, and participatory (P4) medicine, which requires new strategies, both scientific and organizational, to enable bringing this revolution in medicine to patients and to the healthcare system. P4 medicine will have a profound impact on society—transforming the healthcare system, turning around the ever escalating costs of healthcare, digitizing the practice of medicine and creating enormous economic opportunities for those organizations and nations that embrace this revolution
Systems medicine; cancer complexity; quantized cell populations; blood biomarkers; molecular diagnostics; P4 medicine
Clostridium beijerinckii is a well-known solvent-producing microorganism with great potential for biofuel and biochemical production. To better understand and improve the biochemical pathway to solvents, the development of genetic tools for engineering C. beijerinckii is highly desired. Based on mobile group II intron technology, a targetron gene knockout system was developed for C. beijerinckii in this study. This system was successfully employed to disrupt acid production pathways in C. beijerinckii, leading to pta (encoding phosphotransacetylase)- and buk (encoding butyrate kinase)-negative mutants. In addition to experimental characterization, the mutant phenotypes were analyzed in the context of our C. beijerinckii genome-scale model. Compared to those of the parental strain (C. beijerinckii 8052), acetate production in the pta mutant was substantially reduced and butyrate production was remarkably increased, while solvent production was dependent on the growth medium. The pta mutant also produced much higher levels of lactate, suggesting that disrupting pta influenced the energy generation and electron flow pathways. In contrast, acetate and butyrate production in the buk mutant was generally similar to that of the wild type, but solvent production was consistently 20 to 30% higher and glucose consumption was more rapid and complete. Our results suggest that the acid and solvent production of C. beijerinckii can be effectively altered by disrupting the acid production pathways. As the gene disruption method developed in this study does not leave any antibiotic marker in a disrupted allele, multiple and high-throughput gene disruption is feasible for elucidating genotype and phenotype relationships in C. beijerinckii.
Comparative genomics is a powerful approach for studying variation in physiological traits as well as the evolution and ecology of microorganisms. Recent technological advances have enabled sequencing large numbers of related genomes in a single project, requiring computational tools for their integrated analysis. In particular, accurate annotations and identification of gene presence and absence are critical for understanding and modeling the cellular physiology of newly sequenced genomes. Although many tools are available to compare the gene contents of related genomes, new tools are necessary to enable close examination and curation of protein families from large numbers of closely related organisms, to integrate curation with the analysis of gain and loss, and to generate metabolic networks linking the annotations to observed phenotypes.
We have developed ITEP, an Integrated Toolkit for Exploration of microbial Pan-genomes, to curate protein families, compute similarities to externally-defined domains, analyze gene gain and loss, and generate draft metabolic networks from one or more curated reference network reconstructions in groups of related microbial species among which the combination of core and variable genes constitute the their "pan-genomes". The ITEP toolkit consists of: (1) a series of modular command-line scripts for identification, comparison, curation, and analysis of protein families and their distribution across many genomes; (2) a set of Python libraries for programmatic access to the same data; and (3) pre-packaged scripts to perform common analysis workflows on a collection of genomes. ITEP’s capabilities include de novo protein family prediction, ortholog detection, analysis of functional domains, identification of core and variable genes and gene regions, sequence alignments and tree generation, annotation curation, and the integration of cross-genome analysis and metabolic networks for study of metabolic network evolution.
ITEP is a powerful, flexible toolkit for generation and curation of protein families. ITEP's modular design allows for straightforward extension as analysis methods and tools evolve. By integrating comparative genomics with the development of draft metabolic networks, ITEP harnesses the power of comparative genomics to build confidence in links between genotype and phenotype and helps disambiguate gene annotations when they are evaluated in both evolutionary and metabolic network contexts.
Comparative genomics; Clustering; Curation; Database; Metabolic networks; Orthologs; Pan-genome; Phylogenetics
Multiple models of human metabolism have been reconstructed, but each represents only a subset of our knowledge. Here we describe Recon 2, a community-driven, consensus ‘metabolic reconstruction’, which is the most comprehensive representation of human metabolism that is applicable to computational modeling. Compared with its predecessors, the reconstruction has improved topological and functional features, including ~2× more reactions and ~1.7× more unique metabolites. Using Recon 2 we predicted changes in metabolite biomarkers for 49 inborn errors of metabolism with 77% accuracy when compared to experimental data. Mapping metabolomic data and drug information onto Recon 2 demonstrates its potential for integrating and analyzing diverse data types. Using protein expression data, we automatically generated a compendium of 65 cell type–specific models, providing a basis for manual curation or investigation of cell-specific metabolic properties. Recon 2 will facilitate many future biomedical studies and is freely available at http://humanmetabolism.org/.
There is a strong need for computational frameworks that integrate different biological processes and data-types to unravel cellular regulation. Current efforts to reconstruct transcriptional regulatory networks (TRNs) focus primarily on proximal data such as gene co-expression and transcription factor (TF) binding. While such approaches enable rapid reconstruction of TRNs, the overwhelming combinatorics of possible networks limits identification of mechanistic regulatory interactions. Utilizing growth phenotypes and systems-level constraints to inform regulatory network reconstruction is an unmet challenge. We present our approach Gene Expression and Metabolism Integrated for Network Inference (GEMINI) that links a compendium of candidate regulatory interactions with the metabolic network to predict their systems-level effect on growth phenotypes. We then compare predictions with experimental phenotype data to select phenotype-consistent regulatory interactions. GEMINI makes use of the observation that only a small fraction of regulatory network states are compatible with a viable metabolic network, and outputs a regulatory network that is simultaneously consistent with the input genome-scale metabolic network model, gene expression data, and TF knockout phenotypes. GEMINI preferentially recalls gold-standard interactions (p-value = 10−172), significantly better than using gene expression alone. We applied GEMINI to create an integrated metabolic-regulatory network model for Saccharomyces cerevisiae involving 25,000 regulatory interactions controlling 1597 metabolic reactions. The model quantitatively predicts TF knockout phenotypes in new conditions (p-value = 10−14) and revealed potential condition-specific regulatory mechanisms. Our results suggest that a metabolic constraint-based approach can be successfully used to help reconstruct TRNs from high-throughput data, and highlights the potential of using a biochemically-detailed mechanistic framework to integrate and reconcile inconsistencies across different data-types. The algorithm and associated data are available at https://sourceforge.net/projects/gemini-data/
Cellular networks, such as metabolic and transcriptional regulatory networks (TRNs), do not operate independently but work together in unison to determine cellular phenotypes. Further, the phenotype and architecture of one network constrains the topology of other networks. Hence, it is critical to study network components and interactions in the context of the entire cell. Typically, efforts to reconstruct TRNs focus only on immediately proximal data such as gene co-expression and transcription factor (TF)-binding. Herein, we take a different strategy by linking candidate TRNs with the metabolic network to predict systems-level responses such as growth phenotypes of TF knockout strains, and compare predictions with experimental phenotype data to select amongst the candidate TRNs. Our approach goes beyond traditional data integration approaches for network inference and refinement by using a predictive network model (metabolism) to refine another network model (regulation) – thus providing an alternative avenue to this area of research. Understanding how the networks function together in a cell will pave the way for synthetic biology and has a wide-range of applications in biotechnology, drug discovery and diagnostics. Further we demonstrate how metabolic models can integrate and reconcile inconsistencies across different data-types.
Astrocytoma is the most common glioma, accounting for half of all primary brain and spinal cord tumors. Late detection and the aggressive nature of high-grade astrocytomas contribute to high mortality rates. Though many studies identify candidate biomarkers using high-throughput transcriptomic profiling to stratify grades and subtypes, few have resulted in clinically actionable results. This shortcoming can be attributed, in part, to pronounced lab effects that reduce signature robustness and varied individual gene expression among patients with the same tumor. We addressed these issues by uniformly preprocessing publicly available transcriptomic data, comprising 306 tumor samples from three astrocytoma grades (Grade 2, 3, and 4) and 30 non-tumor samples (normal brain as control tissues). Utilizing Differential Rank Conservation (DIRAC), a network-based classification approach, we examined the global and individual patterns of network regulation across tumor grades. Additionally, we applied gene-based approaches to identify genes whose expression changed consistently with increasing tumor grade and evaluated their robustness across multiple studies using statistical sampling. Applying DIRAC, we observed a global trend of greater network dysregulation with increasing tumor aggressiveness. Individual networks displaying greater differences in regulation between adjacent grades play well-known roles in calcium/PKC, EGF, and transcription signaling. Interestingly, many of the 90 individual genes found to monotonically increase or decrease with astrocytoma grade are implicated in cancer-affected processes such as calcium signaling, mitochondrial metabolism, and apoptosis. The fact that specific genes monotonically increase or decrease with increasing astrocytoma grade may reflect shared oncogenic mechanisms among phenotypically similar tumors. This work presents statistically significant results that enable better characterization of different human astrocytoma grades and hopefully can contribute towards improvements in diagnosis and therapy choices. Our results also identify a number of testable hypotheses relating to astrocytoma etiology that may prove helpful in developing much-needed biomarkers for earlier disease detection.
Updates to maintain a state-of-the art reconstruction of the yeast metabolic network are essential to reflect our understanding of yeast metabolism and functional organization, to eliminate any inaccuracies identified in earlier iterations, to improve predictive accuracy and to continue to expand into novel subsystems to extend the comprehensiveness of the model. Here, we present version 6 of the consensus yeast metabolic network (Yeast 6) as an update to the community effort to computationally reconstruct the genome-scale metabolic network of Saccharomyces cerevisiae S288c. Yeast 6 comprises 1458 metabolites participating in 1888 reactions, which are annotated with 900 yeast genes encoding the catalyzing enzymes. Compared with Yeast 5, Yeast 6 demonstrates improved sensitivity, specificity and positive and negative predictive values for predicting gene essentiality in glucose-limited aerobic conditions when analyzed with flux balance analysis. Additionally, Yeast 6 improves the accuracy of predicting the likelihood that a mutation will cause auxotrophy. The network reconstruction is available as a Systems Biology Markup Language (SBML) file enriched with Minimium Information Requested in the Annotation of Biochemical Models (MIRIAM)-compliant annotations. Small- and macromolecules in the network are referenced to authoritative databases such as Uniprot or ChEBI. Molecules and reactions are also annotated with appropriate publications that contain supporting evidence. Yeast 6 is freely available at http://yeast.sf.net/ as three separate SBML files: a model using the SBML level 3 Flux Balance Constraint package, a model compatible with the MATLAB® COBRA Toolbox for backward compatibility and a reconstruction containing only reactions for which there is experimental evidence (without the non-biological reactions necessary for simulating growth).
We utilized abundant transcriptomic data for the primary classes of brain cancers to study the feasibility of separating all of these diseases simultaneously based on molecular data alone. These signatures were based on a new method reported herein – Identification of Structured Signatures and Classifiers (ISSAC) – that resulted in a brain cancer marker panel of 44 unique genes. Many of these genes have established relevance to the brain cancers examined herein, with others having known roles in cancer biology. Analyses on large-scale data from multiple sources must deal with significant challenges associated with heterogeneity between different published studies, for it was observed that the variation among individual studies often had a larger effect on the transcriptome than did phenotype differences, as is typical. For this reason, we restricted ourselves to studying only cases where we had at least two independent studies performed for each phenotype, and also reprocessed all the raw data from the studies using a unified pre-processing pipeline. We found that learning signatures across multiple datasets greatly enhanced reproducibility and accuracy in predictive performance on truly independent validation sets, even when keeping the size of the training set the same. This was most likely due to the meta-signature encompassing more of the heterogeneity across different sources and conditions, while amplifying signal from the repeated global characteristics of the phenotype. When molecular signatures of brain cancers were constructed from all currently available microarray data, 90% phenotype prediction accuracy, or the accuracy of identifying a particular brain cancer from the background of all phenotypes, was found. Looking forward, we discuss our approach in the context of the eventual development of organ-specific molecular signatures from peripheral fluids such as the blood.
From a multi-study, integrated transcriptomic dataset, we identified a marker panel for differentiating major human brain cancers at the gene-expression level. The ISSAC molecular signatures for brain cancers, composed of 44 unique genes, are based on comparing expression levels of pairs of genes, and phenotype prediction follows a diagnostic hierarchy. We found that sufficient dataset integration across multiple studies greatly enhanced diagnostic performance on truly independent validation sets, whereas signatures learned from only one dataset typically led to high error rate. Molecular signatures of brain cancers, when obtained using all currently available gene-expression data, achieved 90% phenotype prediction accuracy. Thus, our integrative approach holds significant promise for developing organ-level, comprehensive, molecular signatures of disease.
The explosion of biomedical data, both on the genomic and proteomic side as well as clinical data, will require complex integration and analysis to provide new molecular variables to better understand the molecular basis of phenotype. Currently, much data exist in silos and is not analyzed in frameworks where all data are brought to bear in the development of biomarkers and novel functional targets. This is beginning to change. Network biology approaches, which emphasize the interactions between genes, proteins and metabolites provide a framework for data integration such that genome, proteome, metabolome and other -omics data can be jointly analyzed to understand and predict disease phenotypes. In this review, recent advances in network biology approaches and results are identified. A common theme is the potential for network analysis to provide multiplexed and functionally connected biomarkers for analyzing the molecular basis of disease, thus changing our approaches to analyzing and modeling genome- and proteome-wide data.
network biology; bioinformatics
Public databases such as the NCBI Gene Expression Omnibus contain extensive and exponentially increasing amounts of high-throughput data that can be applied to molecular phenotype characterization. Collectively, these data can be analyzed for such purposes as disease diagnosis or phenotype classification. One family of algorithms that has proven useful for disease classification is based on relative expression analysis and includes the Top-Scoring Pair (TSP), k-Top-Scoring Pairs (k-TSP), Top-Scoring Triplet (TST) and Differential Rank Conservation (DIRAC) algorithms. These relative expression analysis algorithms hold significant advantages for identifying interpretable molecular signatures for disease classification, and have been implemented previously on a variety of computational platforms with varying degrees of usability. To increase the user-base and maximize the utility of these methods, we developed the program AUREA (Adaptive Unified Relative Expression Analyzer)—a cross-platform tool that has a consistent application programming interface (API), an easy-to-use graphical user interface (GUI), fast running times and automated parameter discovery.
Herein, we describe AUREA, an efficient, cohesive, and user-friendly open-source software system that comprises a suite of methods for relative expression analysis. AUREA incorporates existing methods, while extending their capabilities and bringing uniformity to their interfaces. We demonstrate that combining these algorithms and adaptively tuning parameters on the training sets makes these algorithms more consistent in their performance and demonstrate the effectiveness of our adaptive parameter tuner by comparing accuracy across diverse datasets.
We have integrated several relative expression analysis algorithms and provided a unified interface for their implementation while making data acquisition, parameter fixing, data merging, and results analysis ‘point-and-click’ simple. The unified interface and the adaptive parameter tuning of AUREA provide an effective framework in which to investigate the massive amounts of publically available data by both ‘in silico’ and ‘bench’ scientists. AUREA can be found at http://price.systemsbiology.net/AUREA/.
Human tissues perform diverse metabolic functions. Mapping out these tissue-specific functions in genome-scale models will advance our understanding of the metabolic basis of various physiological and pathological processes. The global knowledgebase of metabolic functions categorized for the human genome (Human Recon 1) coupled with abundant high-throughput data now makes possible the reconstruction of tissue-specific metabolic models. However, the number of available tissue-specific models remains incomplete compared with the large diversity of human tissues.
We developed a method called metabolic Context-specificity Assessed by Deterministic Reaction Evaluation (mCADRE). mCADRE is able to infer a tissue-specific network based on gene expression data and metabolic network topology, along with evaluation of functional capabilities during model building. mCADRE produces models with similar or better functionality and achieves dramatic computational speed up over existing methods. Using our method, we reconstructed draft genome-scale metabolic models for 126 human tissue and cell types. Among these, there are models for 26 tumor tissues along with their normal counterparts, and 30 different brain tissues. We performed pathway-level analyses of this large collection of tissue-specific models and identified the eicosanoid metabolic pathway, especially reactions catalyzing the production of leukotrienes from arachidnoic acid, as potential drug targets that selectively affect tumor tissues.
This large collection of 126 genome-scale draft metabolic models provides a useful resource for studying the metabolic basis for a variety of human diseases across many tissues. The functionality of the resulting models and the fast computational speed of the mCADRE algorithm make it a useful tool to build and update tissue-specific metabolic models.
Automated metabolic network reconstruction; Brain; Cancer metabolism; Tissue-specific metabolic model; Constraint-based modeling
A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify “motifs” that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery–searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA “background” sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are “too null,” resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where “ground truth” is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced “over-fitting” in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/.
Relative expression algorithms such as the top-scoring pair (TSP) and the
top-scoring triplet (TST) have several strengths that distinguish them from
other classification methods, including resistance to overfitting,
invariance to most data normalization methods, and biological
interpretability. The top-scoring ‘N’ (TSN) algorithm is a
generalized form of other relative expression algorithms which uses generic
permutations and a dynamic classifier size to control both the permutation
and combination space available for classification.
TSN was tested on nine cancer datasets, showing statistically significant
differences in classification accuracy between different classifier sizes
(choices of N). TSN also performed competitively against a wide
variety of different classification methods, including artificial neural
networks, classification trees, discriminant analysis, k-Nearest neighbor,
naïve Bayes, and support vector machines, when tested on the Microarray
Quality Control II datasets. Furthermore, TSN exhibits low levels of
overfitting on training data compared to other methods, giving confidence
that results obtained during cross validation will be more generally
applicable to external validation sets.
TSN preserves the strengths of other relative expression algorithms while
allowing a much larger permutation and combination space to be explored,
potentially improving classification accuracies when fewer numbers of
measured features are available.
Classification; Top-scoring pair; Relative expression; Cross validation; Support vector machine; Graphics processing unit; Microarray