Elucidating gene regulatory network (GRN) from large scale experimental data remains a central challenge in systems biology. Recently, numerous techniques, particularly consensus driven approaches combining different algorithms, have become a potentially promising strategy to infer accurate GRNs. Here, we develop a novel consensus inference algorithm, TopkNet that can integrate multiple algorithms to infer GRNs. Comprehensive performance benchmarking on a cloud computing framework demonstrated that (i) a simple strategy to combine many algorithms does not always lead to performance improvement compared to the cost of consensus and (ii) TopkNet integrating only high-performance algorithms provide significant performance improvement compared to the best individual algorithms and community prediction. These results suggest that a priori determination of high-performance algorithms is a key to reconstruct an unknown regulatory network. Similarity among gene-expression datasets can be useful to determine potential optimal algorithms for reconstruction of unknown regulatory networks, i.e., if expression-data associated with known regulatory network is similar to that with unknown regulatory network, optimal algorithms determined for the known regulatory network can be repurposed to infer the unknown regulatory network. Based on this observation, we developed a quantitative measure of similarity among gene-expression datasets and demonstrated that, if similarity between the two expression datasets is high, TopkNet integrating algorithms that are optimal for known dataset perform well on the unknown dataset. The consensus framework, TopkNet, together with the similarity measure proposed in this study provides a powerful strategy towards harnessing the wisdom of the crowds in reconstruction of unknown regulatory networks.
Elucidating gene regulatory networks is crucial to understand disease mechanisms at the system level. A large number of algorithms have been developed to infer gene regulatory networks from gene-expression datasets. If you remember the success of IBM's Watson in ”Jeopardy!„ quiz show, the critical features of Watson were the use of very large numbers of heterogeneous algorithms generating various hypotheses and to select one of which as the answer. We took similar approach, “TopkNet”, to see if “Wisdom of Crowd” approach can be applied for network reconstruction. We discovered that “Wisdom of Crowd” is a powerful approach where integration of optimal algorithms for a given dataset can achieve better results than the best individual algorithm. However, such an analysis begs the question “How to choose optimal algorithms for a given dataset?” We found that similarity among gene-expression datasets is a key to select optimal algorithms, i.e., if dataset A for which optimal algorithms are known is similar to dataset B, the optimal algorithms for dataset A may be also optimal for dataset B. Thus, our “TopkNet” together with similarity measure among datasets can provide a powerful strategy towards harnessing “Wisdom of Crowd” in high-quality reconstruction of gene regulatory networks.
The availability of large-scale protein-protein interaction networks for numerous organisms provides an opportunity to comprehensively analyze whether simple properties of proteins are predictive of the roles they play in the functional organization of the cell. We begin by re-examining an influential but controversial characterization of the dynamic modularity of the S. cerevisiae interactome that incorporated gene expression data into network analysis. We analyse the protein-protein interaction networks of five organisms, S. cerevisiae, H. sapiens, D. melanogaster, A. thaliana, and E. coli, and confirm significant and consistent functional and structural differences between hub proteins that are co-expressed with their interacting partners and those that are not, and support the view that the former tend to be intramodular whereas the latter tend to be intermodular. However, we also demonstrate that in each of these organisms, simple topological measures are significantly correlated with the average co-expression of a hub with its partners, independent of any classification, and therefore also reflect protein intra- and inter- modularity. Further, cross-interactomic analysis demonstrates that these simple topological characteristics of hub proteins tend to be conserved across organisms. Overall, we give evidence that purely topological features of static interaction networks reflect aspects of the dynamics and modularity of interactomes as well as previous measures incorporating expression data, and are a powerful means for understanding the dynamic roles of hubs in interactomes.
A better understanding of protein interaction networks would be a great aid in furthering our knowledge of the molecular biology of the cell. Towards this end, large-scale protein-protein physical interaction data have been determined for organisms across the evolutionary spectrum. However, the resulting networks give a static view of interactomes, and our knowledge about protein interactions is rarely time or context specific. A previous prominent but controversial attempt to characterize the dynamic modularity of the interactome was based on integrating physical interaction data with gene activity measurements from transcript expression data. This analysis distinguished between proteins that are co-expressed with their interacting partners and those that are not, and argued that the former are intramodular and the latter are intermodular. By analyzing the interactomes of five organisms, we largely confirm the biological significance of this characterization through a variety of statistical tests and computational experiments. Surprisingly, however, we find that similar results can be obtained using just network information without additionally integrating expression data, suggesting that purely topological characteristics of interaction networks strongly reflect certain aspects of the dynamics and modularity of interactomes.
Biological regulatory systems face a fundamental tradeoff: they must be effective but at the same time also economical. For example, regulatory systems that are designed to repair damage must be effective in reducing damage, but economical in not making too many repair proteins because making excessive proteins carries a fitness cost to the cell, called protein burden. In order to see how biological systems compromise between the two tasks of effectiveness and economy, we applied an approach from economics and engineering called Pareto optimality. This approach allows calculating the best-compromise systems that optimally combine the two tasks. We used a simple and general model for regulation, known as integral feedback, and showed that best-compromise systems have particular combinations of biochemical parameters that control the response rate and basal level. We find that the optimal systems fall on a curve in parameter space. Due to this feature, even if one is able to measure only a small fraction of the system's parameters, one can infer the rest. We applied this approach to estimate parameters in three biological systems: response to heat shock and response to DNA damage in bacteria, and calcium homeostasis in mammals.
Many systems in the cell work to keep homeostasis, or balance. For example, damage repair systems make special repair proteins to resolve damage. These systems typically have many biochemical parameters such as biochemical rate constants, and it is not clear how much of the huge parameter space is filled by actual biological systems. We examined how natural selection acts on these systems when there are two important tasks: effectiveness – rapidly repairing damage, and economy – avoiding excessive production of repair proteins. We find that this multi-task optimization situation leads to natural selection of circuits that lie on a curve in parameter space. Thus, most of parameter space is empty. Estimating only a few parameters of the circuit is enough to predict the rest. This approach allowed us to estimate parameters for bacterial heat shock and DNA repair systems, and for a mammalian hormone system responsible for calcium homeostasis.
Speculative statements communicating experimental findings are frequently found in scientific articles, and their purpose is to provide an impetus for further investigations into the given topic. Automated recognition of speculative statements in scientific text has gained interest in recent years as systematic analysis of such statements could transform speculative thoughts into testable hypotheses. We describe here a pattern matching approach for the detection of speculative statements in scientific text that uses a dictionary of speculative patterns to classify sentences as hypothetical. To demonstrate the practical utility of our approach, we applied it to the domain of Alzheimer's disease and showed that our automated approach captures a wide spectrum of scientific speculations on Alzheimer's disease. Subsequent exploration of derived hypothetical knowledge leads to generation of a coherent overview on emerging knowledge niches, and can thus provide added value to ongoing research activities.
Published speculations about possible molecular mechanisms underlying normal and diseased biological processes provide valuable input for the generation of new scientific hypotheses. However, a systematic gathering of all scientific speculation that exists in a given context is a non-trivial task and, if done manually, is laborious and time-consuming. The “HypothesisFinder” approach outlined here provides a possible solution for making scientific speculation gathering more tractable. Using a dictionary of speculative patterns, HypothesisFinder detects, collates and analyzes published speculative statements for a specific context. This can be extremely useful, particularly in reference to complex and poorly understood diseases like Alzheimer's disease. For example, by formulating a series of reasonable speculations on causes and effects, we could gain new insights into the directions of Alzheimer's disease etiology and progression. An effective literature search with the help of HypothesisFinder can support the process of knowledge discovery and hypothesis generation, which has the potential to add value to ongoing research activities.
Current drug discovery is impossible without sophisticated modeling and computation. In this review we outline previous advances in computational biology and, by tracing the steps involved in pharmaceutical development, explore a range of novel, high-value opportunities for computational innovation in modeling the biological process of disease and the social process of drug discovery. These opportunities include text mining for new drug leads, modeling molecular pathways and predicting the efficacy of drug cocktails, analyzing genetic overlap between diseases and predicting alternative drug use. Computation can also be used to model research teams and innovative regions and to estimate the value of academy–industry links for scientific and human benefit. Attention to these opportunities could promise punctuated advance and will complement the well-established computational work on which drug discovery currently relies.
It is acknowledged that some obesity trajectories are set early in life, and that rapid weight gain in infancy is a risk factor for later development of obesity. Identifying modifiable factors associated with early rapid weight gain is a prerequisite for curtailing the growing worldwide obesity epidemic. Recently, much attention has been given to findings indicating that gut microbiota may play a role in obesity development. We aim at identifying how the development of early gut microbiota is associated with expected infant growth. We developed a novel procedure that allows for the identification of longitudinal gut microbiota patterns (corresponding to the gut ecosystem developing), which are associated with an outcome of interest, while appropriately controlling for the false discovery rate. Our method identified developmental pathways of Staphylococcus species and Escherichia coli that were associated with expected growth, and traditional methods indicated that the detection of Bacteroides species at day 30 was associated with growth. Our method should have wide future applicability for studying gut microbiota, and is particularly important for translational considerations, as it is critical to understand the timing of microbiome transitions prior to attempting to manipulate gut microbiota in early life.
Some obesity trajectories are set early in life, with rapid weight gain being a risk factor for later development of obesity. Recently, much attention has been given to findings indicating that gut microbiota may play a role in obesity development. The existence of time-dependent exposure windows, which rely on stimuli from the gut to initiate healthy development, gives the evolution of early life gut microbiota a critical role in human health. We identified children that followed their expected growth trajectories at six months of life, and those that had deviated. We then developed a novel statistical approach that allowed the identification of longitudinal gut microbiota patterns (e.g. a particular species was detected at days 4, 10, and 30 and not detected at day 120) that were associated with expected growth, while appropriately restricting the false discovery rate. We further identified when a deviation from the proposed longitudinal gut microbiota patterns would result in an abnormal growth outcome (either rapid or decreased growth at six months of life). We found developmental pathways of Staphylococcus species and Escherichia coli that were associated with expected growth, as well as indications that Bacteroides species at day 30 was associated with growth.
The theory of probability is widely used in biomedical research for data analysis and modelling. In previous work the probabilities of the research hypotheses have been recorded as experimental metadata. The ontology HELO is designed to support probabilistic reasoning, and provides semantic descriptors for reporting on research that involves operations with probabilities. HELO explicitly links research statements such as hypotheses, models, laws, conclusions, etc. to the associated probabilities of these statements being true. HELO enables the explicit semantic representation and accurate recording of probabilities in hypotheses, as well as the inference methods used to generate and update those hypotheses. We demonstrate the utility of HELO on three worked examples: changes in the probability of the hypothesis that sirtuins regulate human life span; changes in the probability of hypotheses about gene functions in the S. cerevisiae aromatic amino acid pathway; and the use of active learning in drug design (quantitative structure activity relation learning), where a strategy for the selection of compounds with the highest probability of improving on the best known compound was used. HELO is open source and available at https://github.com/larisa-soldatova/HELO
ontology; knowledge representation; probabilistic reasoning
Several recent studies have examined different aspects of mammalian higher order chromatin structure – replication timing, lamina association and Hi-C inter-locus interactions — and have suggested that most of these features of genome organisation are conserved over evolution. However, the extent of evolutionary divergence in higher order structure has not been rigorously measured across the mammalian genome, and until now little has been known about the characteristics of any divergent loci present. Here, we generate a dataset combining multiple measurements of chromatin structure and organisation over many embryonic cell types for both human and mouse that, for the first time, allows a comprehensive assessment of the extent of structural divergence between mammalian genomes. Comparison of orthologous regions confirms that all measurable facets of higher order structure are conserved between human and mouse, across the vast majority of the detectably orthologous genome. This broad similarity is observed in spite of many loci possessing cell type specific structures. However, we also identify hundreds of regions (from 100 Kb to 2.7 Mb in size) showing consistent evidence of divergence between these species, constituting at least 10% of the orthologous mammalian genome and encompassing many hundreds of human and mouse genes. These regions show unusual shifts in human GC content, are unevenly distributed across both genomes, and are enriched in human subtelomeric regions. Divergent regions are also relatively enriched for genes showing divergent expression patterns between human and mouse ES cells, implying these regions cause divergent regulation. Particular divergent loci are strikingly enriched in genes implicated in vertebrate development, suggesting important roles for structural divergence in the evolution of mammalian developmental programmes. These data suggest that, though relatively rare in the mammalian genome, divergence in higher order chromatin structure has played important roles during evolution.
The mammalian genome is organised into large multi-megabase domains defined by their physical structure, or higher order chromatin structure. Although these structures are believed to be well conserved between species, there have been few studies attempting to quantify such conservation, or identify divergent structures. We find that regions showing clear evidence of divergence in higher order chromatin structure encompass at least 10% of the mammalian genome, and include many hundreds of genes whose regulation may have been affected. At least some of these genes have been directly implicated in evolutionary innovations to vertebrate developmental programmes, so divergent regions may have been disproportionately important during evolution. In addition, we show that divergent regions occur in large stretches of more than 2 Mb in the human genome and are enriched towards telomeres at the ends of human chromosomes. This may reflect shifts in the nuclear organisation and regulatory functions of chromatin domains between human and mouse.
The decreasing cost of sequencing is leading to a growing repertoire of personal genomes. However, we are lagging behind in understanding the functional consequences of the millions of variants obtained from sequencing. Global system-wide effects of variants in coding genes are particularly poorly understood. It is known that while variants in some genes can lead to diseases, complete disruption of other genes, called ‘loss-of-function tolerant’, is possible with no obvious effect. Here, we build a systems-based classifier to quantitatively estimate the global perturbation caused by deleterious mutations in each gene. We first survey the degree to which gene centrality in various individual networks and a unified ‘Multinet’ correlates with the tolerance to loss-of-function mutations and evolutionary conservation. We find that functionally significant and highly conserved genes tend to be more central in physical protein-protein and regulatory networks. However, this is not the case for metabolic pathways, where the highly central genes have more duplicated copies and are more tolerant to loss-of-function mutations. Integration of three-dimensional protein structures reveals that the correlation with centrality in the protein-protein interaction network is also seen in terms of the number of interaction interfaces used. Finally, combining all the network and evolutionary properties allows us to build a classifier distinguishing functionally essential and loss-of-function tolerant genes with higher accuracy (AUC = 0.91) than any individual property. Application of the classifier to the whole genome shows its strong potential for interpretation of variants involved in Mendelian diseases and in complex disorders probed by genome-wide association studies.
The number of personal genomes sequenced has grown rapidly over the last few years and is likely to grow further. In order to use the DNA sequence variants amongst individuals for personalized medicine, we need to understand the functional impact of these variants. Deleterious variants in genes can have a wide spectrum of global effects, ranging from fatal for essential genes to no obvious damaging effect for loss-of-function tolerant genes. The global effect of a gene mutation is largely governed by the diverse biological networks in which the gene participates. Since genes participate in many networks, no singular network captures the global picture of gene interactions. Here we integrate the diverse modes of gene interactions (regulatory, genetic, phosphorylation, signaling, metabolic and physical protein-protein interactions) to create a unified biological network. We then exploit the unique properties of loss-of-function tolerant and essential genes in this unified network to build a computational model that can predict global perturbation caused by deleterious mutations in all genes. Our model can distinguish between these two gene sets with high accuracy and we further show that it can be used for interpretation of variants involved in Mendelian diseases and in complex disorders probed by genome-wide association studies.
To understand the complex relationship governing transcript abundance and the level of the encoded protein, we integrate genome-wide experimental data of ribosomal density on mRNAs with a novel stochastic model describing ribosome traffic dynamics during translation elongation. This analysis reveals that codon arrangement, rather than simply codon bias, has a key role in determining translational efficiency. It also reveals that translation output is governed both by initiation efficiency and elongation dynamics. By integrating genome-wide experimental data sets with simulation of ribosome traffic on all Saccharomyces cerevisiae ORFs, mRNA-specific translation initiation rates are for the first time estimated across the entire transcriptome. Our analysis identifies different classes of mRNAs characterised by their initiation rates, their ribosome traffic dynamics, and by their response to ribosome availability. Strikingly, this classification based on translational dynamics maps onto key gene ontological classifications, revealing evolutionary optimisation of translation responses to be strongly influenced by gene function.
Gene expression regulation is central to all living systems. Here we introduce a new framework and methodology to study the last stage of protein production in cells, where the genetic information encoded in the mRNAs is translated from the language of nucleotides into functional proteins. The process, on each mRNA, is carried out concurrently by several ribosomes; like cars on a small countryside road, they cannot overtake each other, and can form queues. By integrating experimental data with genome-wide simulations of our model, we analyse ribosome traffic across the entire Saccharomyces cerevisiae genome, and for the first time estimate mRNA-specific translation initiation rates for each transcript. Crucially, we identify different classes of mRNAs characterised by different ribosome traffic dynamics. Remarkably, this classification based on translational dynamics, and the evaluation of mRNA-specific initiation rates, map onto key gene ontological classifications, revealing evolutionary optimisation of translation responses to be strongly influenced by gene function.
Our understanding of most biological systems is in its infancy. Learning their structure and intricacies is fraught with challenges, and often side-stepped in favour of studying the function of different gene products in isolation from their physiological context. Constructing and inferring global mathematical models from experimental data is, however, central to systems biology. Different experimental setups provide different insights into such systems. Here we show how we can combine concepts from Bayesian inference and information theory in order to identify experiments that maximize the information content of the resulting data. This approach allows us to incorporate preliminary information; it is global and not constrained to some local neighbourhood in parameter space and it readily yields information on parameter robustness and confidence. Here we develop the theoretical framework and apply it to a range of exemplary problems that highlight how we can improve experimental investigations into the structure and dynamics of biological systems and their behavior.
For most biological signalling and regulatory systems we still lack reliable mechanistic models. And where such models exist, e.g. in the form of differential equations, we typically have only rough estimates for the parameters that characterize the biochemical reactions. In order to improve our knowledge of such systems we require better estimates for these parameters and here we show how judicious choice of experiments, based on a combination of simulations and information theoretical analysis, can help us. Our approach builds on the available, frequently rudimentary information, and identifies which experimental set-up provides most additional information about all the parameters, or individual parameters. We will also consider the related but subtly different problem of which experiments need to be performed in order to decrease the uncertainty about the behaviour of the system under altered conditions. We develop the theoretical framework in the necessary detail before illustrating its use and applying it to the repressilator model, the regulation of Hes1 and signal transduction in the Akt pathway.
Interactions of proteins regulate signaling, catalysis, gene expression and many other cellular functions. Therefore, characterizing the entire human interactome is a key effort in current proteomics research. This challenge is complicated by the dynamic nature of protein-protein interactions (PPIs), which are conditional on the cellular context: both interacting proteins must be expressed in the same cell and localized in the same organelle to meet. Additionally, interactions underlie a delicate control of signaling pathways, e.g. by post-translational modifications of the protein partners - hence, many diseases are caused by the perturbation of these mechanisms. Despite the high degree of cell-state specificity of PPIs, many interactions are measured under artificial conditions (e.g. yeast cells are transfected with human genes in yeast two-hybrid assays) or even if detected in a physiological context, this information is missing from the common PPI databases. To overcome these problems, we developed a method that assigns context information to PPIs inferred from various attributes of the interacting proteins: gene expression, functional and disease annotations, and inferred pathways. We demonstrate that context consistency correlates with the experimental reliability of PPIs, which allows us to generate high-confidence tissue- and function-specific subnetworks. We illustrate how these context-filtered networks are enriched in bona fide pathways and disease proteins to prove the ability of context-filters to highlight meaningful interactions with respect to various biological questions. We use this approach to study the lung-specific pathways used by the influenza virus, pointing to IRAK1, BHLHE40 and TOLLIP as potential regulators of influenza virus pathogenicity, and to study the signalling pathways that play a role in Alzheimer's disease, identifying a pathway involving the altered phosphorylation of the Tau protein. Finally, we provide the annotated human PPI network via a web frontend that allows the construction of context-specific networks in several ways.
Protein-protein-interactions (PPIs) participate in virtually all biological processes. However, the PPI map is not static but the pairs of proteins that interact depends on the type of cell, the subcellular localization and modifications of the participating proteins, among many other factors. Therefore, it is important to understand the specific conditions under which a PPI happens. Unfortunately, experimental methods often do not provide this information or, even worse, measure PPIs under artificial conditions not found in biological systems. We developed a method to infer this missing information from properties of the interacting proteins, such as in which cell types the proteins are found, which functions they fulfill and whether they are known to play a role in disease. We show that PPIs for which we can infer conditions under which they happen have a higher experimental reliability. Also, our inference agrees well with known pathways and disease proteins. Since diseases usually affect specific cell types, we study PPI networks of influenza proteins in lung tissues and of Alzheimer's disease proteins in neural tissues. In both cases, we can highlight interesting interactions potentially playing a role in disease progression.
New microbial genomes are sequenced at a high pace, allowing insight into the genetics of not only cultured microbes, but a wide range of metagenomic collections such as the human microbiome. To understand the deluge of genomic data we face, computational approaches for gene functional annotation are invaluable. We introduce a novel model for computational annotation that refines two established concepts: annotation based on homology and annotation based on phyletic profiling. The phyletic profiling-based model that includes both inferred orthologs and paralogs—homologs separated by a speciation and a duplication event, respectively—provides more annotations at the same average Precision than the model that includes only inferred orthologs. For experimental validation, we selected 38 poorly annotated Escherichia coli genes for which the model assigned one of three GO terms with high confidence: involvement in DNA repair, protein translation, or cell wall synthesis. Results of antibiotic stress survival assays on E. coli knockout mutants showed high agreement with our model's estimates of accuracy: out of 38 predictions obtained at the reported Precision of 60%, we confirmed 25 predictions, indicating that our confidence estimates can be used to make informed decisions on experimental validation. Our work will contribute to making experimental validation of computational predictions more approachable, both in cost and time. Our predictions for 998 prokaryotic genomes include ∼400000 specific annotations with the estimated Precision of 90%, ∼19000 of which are highly specific—e.g. “penicillin binding,” “tRNA aminoacylation for protein translation,” or “pathogenesis”—and are freely available at http://gorbi.irb.hr/.
While both the number and the diversity of sequenced prokaryotic genomes grow rapidly, the number of specific assignments of gene functions in the databases remains low and skewed toward the model prokaryote Escherichia coli. To aid in understanding the full set of newly sequenced genes, we created a computational model for assignment of function to prokaryotic genomes. The result is an innovative framework for orthology and paralogy-aware phyletic profiling that provides a large number of computational annotations with high predictive accuracy in train/test evaluations. Our predictions include annotations for 1.3 million genes with the estimated Precision of 90%; these, and many more predictions for 998 prokaryotic genomes are freely available at http://gorbi.irb.hr/. More importantly, we show a proof of principle that our functional annotation model can be used to generate new biological hypotheses: we performed experiments on 38 E. coli knockout mutants and showed that our annotation model provides realistic estimates of predictive accuracy. With this, our work will contribute to making experimental validation of computational predictions more approachable, both in cost and time.
In just the last decade, a multitude of bio-technologies and software pipelines have emerged to revolutionize genomics. To further their central goal, they aim to accelerate and improve the quality of de novo whole-genome assembly starting from short DNA sequences/reads. However, the performance of each of these tools is contingent on the length and quality of the sequencing data, the structure and complexity of the genome sequence, and the resolution and quality of long-range information. Furthermore, in the absence of any metric that captures the most fundamental “features” of a high-quality assembly, there is no obvious recipe for users to select the most desirable assembler/assembly. This situation has prompted the scientific community to rely on crowd-sourcing through international competitions, such as Assemblathons or GAGE, with the intention of identifying the best assembler(s) and their features. Somewhat circuitously, the only available approach to gauge de novo assemblies and assemblers relies solely on the availability of a high-quality fully assembled reference genome sequence. Still worse, reference-guided evaluations are often both difficult to analyze, leading to conclusions that are difficult to interpret. In this paper, we circumvent many of these issues by relying upon a tool, dubbed , which is capable of evaluating de novo assemblies from the read-layouts even when no reference exists. We extend the FRCurve approach to cases where lay-out information may have been obscured, as is true in many deBruijn-graph-based algorithms. As a by-product, FRCurve now expands its applicability to a much wider class of assemblers – thus, identifying higher-quality members of this group, their inter-relations as well as sensitivity to carefully selected features, with or without the support of a reference sequence or layout for the reads. The paper concludes by reevaluating several recently conducted assembly competitions and the datasets that have resulted from them.
A lack of mature domain knowledge and well established guidelines makes the medical diagnosis of skeletal dysplasias (a group of rare genetic disorders) a very complex process. Machine learning techniques can facilitate objective interpretation of medical observations for the purposes of decision support. However, building decision support models using such techniques is highly problematic in the context of rare genetic disorders, because it depends on access to mature domain knowledge. This paper describes an approach for developing a decision support model in medical domains that are underpinned by relatively sparse knowledge bases. We propose a solution that combines association rule mining with the Dempster-Shafer theory (DST) to compute probabilistic associations between sets of clinical features and disorders, which can then serve as support for medical decision making (e.g., diagnosis). We show, via experimental results, that our approach is able to provide meaningful outcomes even on small datasets with sparse distributions, in addition to outperforming other Machine Learning techniques and behaving slightly better than an initial diagnosis by a clinician.
Accumulating experimental evidence suggests that the gene regulatory networks of living organisms operate in the critical phase, namely, at the transition between ordered and chaotic dynamics. Such critical dynamics of the network permits the coexistence of robustness and flexibility which are necessary to ensure homeostatic stability (of a given phenotype) while allowing for switching between multiple phenotypes (network states) as occurs in development and in response to environmental change. However, the mechanisms through which genetic networks evolve such critical behavior have remained elusive. Here we present an evolutionary model in which criticality naturally emerges from the need to balance between the two essential components of evolvability: phenotype conservation and phenotype innovation under mutations. We simulated the Darwinian evolution of random Boolean networks that mutate gene regulatory interactions and grow by gene duplication. The mutating networks were subjected to selection for networks that both (i) preserve all the already acquired phenotypes (dynamical attractor states) and (ii) generate new ones. Our results show that this interplay between extending the phenotypic landscape (innovation) while conserving the existing phenotypes (conservation) suffices to cause the evolution of all the networks in a population towards criticality. Furthermore, the networks produced by this evolutionary process exhibit structures with hubs (global regulators) similar to the observed topology of real gene regulatory networks. Thus, dynamical criticality and certain elementary topological properties of gene regulatory networks can emerge as a byproduct of the evolvability of the phenotypic landscape.
Dynamically critical systems are those which operate at the border of a phase transition between two behavioral regimes often present in complex systems: order and disorder. Critical systems exhibit remarkable properties such as fast information processing, collective response to perturbations or the ability to integrate a wide range of external stimuli without saturation. Recent evidence indicates that the genetic networks of living cells are dynamically critical. This has far reaching consequences, for it is at criticality that living organisms can tolerate a wide range of external fluctuations without changing the functionality of their phenotypes. Therefore, it is necessary to know how genetic criticality emerged through evolution. Here we show that dynamical criticality naturally emerges from the delicate balance between two fundamental forces of natural selection that make organisms evolve: (i) the existing phenotypes must be resilient to random mutations, and (ii) new phenotypes must emerge for the organisms to adapt to new environmental challenges. The joint effect of these two forces, which are essential for evolvability, is sufficient in our computational models to generate populations of genetic networks operating at criticality. Thus, natural selection acting as a tinkerer of evolvable systems naturally generates critical dynamics.
Gene co-expression network analysis is an effective method for predicting gene functions and disease biomarkers. However, few studies have systematically identified co-expressed genes involved in the molecular origin and development of various types of tumors. In this study, we used a network mining algorithm to identify tightly connected gene co-expression networks that are frequently present in microarray datasets from 33 types of cancer which were derived from 16 organs/tissues. We compared the results with networks found in multiple normal tissue types and discovered 18 tightly connected frequent networks in cancers, with highly enriched functions on cancer-related activities. Most networks identified also formed physically interacting networks. In contrast, only 6 networks were found in normal tissues, which were highly enriched for housekeeping functions. The largest cancer network contained many genes with genome stability maintenance functions. We tested 13 selected genes from this network for their involvement in genome maintenance using two cell-based assays. Among them, 10 were shown to be involved in either homology-directed DNA repair or centrosome duplication control including the well- known cancer marker MKI67. Our results suggest that the commonly recognized characteristics of cancers are supported by highly coordinated transcriptomic activities. This study also demonstrated that the co-expression network directed approach provides a powerful tool for understanding cancer physiology, predicting new gene functions, as well as providing new target candidates for cancer therapeutics.
Proteins interact with each other in a network manner to precisely regulate complicated physiological functions of life. Diseases such as cancer may occur if the network regulations go wrong. In cancer research, network mining has been utilized to identify biomarkers, predict therapeutic targets, and discover new mechanisms for cancer development. Among these applications, the search for genes with similar expression patterns (co-expression) over different samples is particularly successful. However, few network mining approaches were systematically applied to different types of cancers to extract common cancer features. We carried out a systematic study to identify frequently co-expressed gene networks in multiple cancers and compared them with the gene networks found in multiple normal tissues. We found dramatic differences between networks from the two sources, with gene networks in cancer corresponding to specific traits of cancer. Specifically, the largest gene network in cancer contains many genes with cell cycle control and DNA stability functions. We thus predicted that a set of poorly studied genes in this network share similar functions and validated that most of these genes are involved in DNA break repair or proper cell division. To the best of our knowledge, this is the largest scale of such a study.
HES/HEY genes encode a family of basic helix-loop-helix (bHLH) transcription factors with both bHLH and Orange domain. HES/HEY proteins are direct targets of the Notch signaling pathway and play an essential role in developmental decisions, such as the developments of nervous system, somitogenesis, blood vessel and heart. Despite their important functions, the origin and evolution of this HES/HEY gene family has yet to be elucidated.
Methods and Findings
In this study, we identified genes of the HES/HEY family in representative species and performed evolutionary analysis to elucidate their origin and evolutionary process. Our results showed that the HES/HEY genes only existed in metazoans and may originate from the common ancestor of metazoans. We identified HES/HEY genes in more than 10 species representing the main lineages. Combining the bHLH and Orange domain sequences, we constructed the phylogenetic trees by different methods (Bayesian, ML, NJ and ME) and classified the HES/HEY gene family into four groups. Our results indicated that this gene family had undergone three expansions, which were along with the origins of Eumetazoa, vertebrate, and teleost. Gene structure analysis revealed that the HES/HEY genes were involved in exon and/or intron loss in different species lineages. Genes of this family were duplicated in bony fishes and doubled than other vertebrates. Furthermore, we studied the teleost-specific duplications in zebrafish and investigated the expression pattern of duplicated genes in different tissues by RT-PCR. Finally, we proposed a model to show the evolution of this gene family with processes of expansion, exon/intron loss, and motif loss.
Our study revealed the evolution of HES/HEY gene family, the expression and function divergence of duplicated genes, which also provide clues for the research of Notch function in development. This study shows a model of gene family analysis with gene structure evolution and duplication.
Hepatitis C is a treatment-resistant disease affecting millions of people worldwide. The hepatitis C virus (HCV) genome is a single-stranded RNA molecule. After infection of the host cell, viral RNA is translated into a polyprotein that is cleaved by host and viral proteinases into functional, structural and non-structural, viral proteins. Cleavage of the polyprotein involves the viral NS3/4A proteinase, a proven drug target. HCV mutates as it replicates and, as a result, multiple emerging quasispecies become rapidly resistant to anti-virals, including NS3/4A inhibitors.
To circumvent drug resistance and complement the existing anti-virals, NS3/4A inhibitors, which are additional and distinct from the FDA-approved telaprevir and boceprevir α-ketoamide inhibitors, are required. To test potential new avenues for inhibitor development, we have probed several distinct exosites of NS3/4A which are either outside of or partially overlapping with the active site groove of the proteinase. For this purpose, we employed virtual ligand screening using the 275,000 compound library of the Developmental Therapeutics Program (NCI/NIH) and the X-ray crystal structure of NS3/4A as a ligand source and a target, respectively. As a result, we identified several novel, previously uncharacterized, nanomolar range inhibitory scaffolds, which suppressed of the NS3/4A activity in vitro and replication of a sub-genomic HCV RNA replicon with a luciferase reporter in human hepatocarcinoma cells. The binding sites of these novel inhibitors do not significantly overlap with those of α-ketoamides. As a result, the most common resistant mutations, including V36M, R155K, A156T, D168A and V170A, did not considerably diminish the inhibitory potency of certain novel inhibitor scaffolds we identified.
Overall, the further optimization of both the in silico strategy and software platform we developed and lead compounds we identified may lead to advances in novel anti-virals.
Combinatorial gene perturbations provide rich information for a systematic exploration of genetic interactions. Despite successful applications to bacteria and yeast, the scalability of this approach remains a major challenge for higher organisms such as humans. Here, we report a novel experimental and computational framework to efficiently address this challenge by limiting the ‘search space’ for important genetic interactions. We propose to integrate rich phenotypes of multiple single gene perturbations to robustly predict functional modules, which can subsequently be subjected to further experimental investigations such as combinatorial gene silencing. We present posterior association networks (PANs) to predict functional interactions between genes estimated using a Bayesian mixture modelling approach. The major advantage of this approach over conventional hypothesis tests is that prior knowledge can be incorporated to enhance predictive power. We demonstrate in a simulation study and on biological data, that integrating complementary information greatly improves prediction accuracy. To search for significant modules, we perform hierarchical clustering with multiscale bootstrap resampling. We demonstrate the power of the proposed methodologies in applications to Ewing's sarcoma and human adult stem cells using publicly available and custom generated data, respectively. In the former application, we identify a gene module including many confirmed and highly promising therapeutic targets. Genes in the module are also significantly overrepresented in signalling pathways that are known to be critical for proliferation of Ewing's sarcoma cells. In the latter application, we predict a functional network of chromatin factors controlling epidermal stem cell fate. Further examinations using ChIP-seq, ChIP-qPCR and RT-qPCR reveal that the basis of their genetic interactions may arise from transcriptional cross regulation. A Bioconductor package implementing PAN is freely available online at http://bioconductor.org/packages/release/bioc/html/PANR.html.
Synthetic genetic interactions estimated from combinatorial gene perturbation screens provide systematic insights into synergistic interactions of genes in a biological process. However, this approach lacks scalability for large-scale genetic interaction profiling in metazoan organisms such as humans. We contribute to this field by proposing a more scalable and affordable approach, which takes the advantage of multiple single gene perturbation data to predict coherent functional modules followed by genetic interaction investigation using combinatorial perturbations. We developed a versatile computational framework (PAN) to robustly predict functional interactions and search for significant functional modules from rich phenotyping screens of single gene perturbations under different conditions or from multiple cell lines. PAN features a Bayesian mixture model to assess statistical significance of functional associations, the capability to incorporate prior knowledge as well as a generalized approach to search for significant functional modules by multiscale bootstrap resampling. In applications to Ewing's sarcoma and human adult stem cells, we demonstrate the general applicability and prediction power of PAN to both public and custom generated screening data.
Identifying human genes relevant for the processing of pain requires difficult-to-conduct and expensive large-scale clinical trials. Here, we examine a novel integrative paradigm for data-driven discovery of pain gene candidates, taking advantage of the vast amount of existing disease-related clinical literature and gene expression microarray data stored in large international repositories. First, thousands of diseases were ranked according to a disease-specific pain index (DSPI), derived from Medical Subject Heading (MESH) annotations in MEDLINE. Second, gene expression profiles of 121 of these human diseases were obtained from public sources. Third, genes with expression variation significantly correlated with DSPI across diseases were selected as candidate pain genes. Finally, selected candidate pain genes were genotyped in an independent human cohort and prospectively evaluated for significant association between variants and measures of pain sensitivity. The strongest signal was with rs4512126 (5q32, ABLIM3, P = 1.3×10−10) for the sensitivity to cold pressor pain in males, but not in females. Significant associations were also observed with rs12548828, rs7826700 and rs1075791 on 8q22.2 within NCALD (P = 1.7×10−4, 1.8×10−4, and 2.2×10−4 respectively). Our results demonstrate the utility of a novel paradigm that integrates publicly available disease-specific gene expression data with clinical data curated from MEDLINE to facilitate the discovery of pain-relevant genes. This data-derived list of pain gene candidates enables additional focused and efficient biological studies validating additional candidates.
The mechanisms underlying pain are incompletely understood, and are hard to study due to the subjective and complex nature of pain. From a genetics perspective, the discovery of genes relevant for the processing of pain in humans has been slow and genome-wide association studies have not been successful in yielding significantly associated variants. Targeted approaches examining specific candidate genes may be more promising. We present a novel integrative approach that combines publicly available molecular data and automatically extracted knowledge regarding pain contained in the literature to assist the discovery of novel pain genes. We prospectively validated this approach by demonstrating a significant association between several newly identified pain gene candidates and sensitivity to cold pressor pain.
Gene networks are commonly interpreted as encoding functional information in their connections. An extensively validated principle called guilt by association states that genes which are associated or interacting are more likely to share function. Guilt by association provides the central top-down principle for analyzing gene networks in functional terms or assessing their quality in encoding functional information. In this work, we show that functional information within gene networks is typically concentrated in only a very few interactions whose properties cannot be reliably related to the rest of the network. In effect, the apparent encoding of function within networks has been largely driven by outliers whose behaviour cannot even be generalized to individual genes, let alone to the network at large. While experimentalist-driven analysis of interactions may use prior expert knowledge to focus on the small fraction of critically important data, large-scale computational analyses have typically assumed that high-performance cross-validation in a network is due to a generalizable encoding of function. Because we find that gene function is not systemically encoded in networks, but dependent on specific and critical interactions, we conclude it is necessary to focus on the details of how networks encode function and what information computational analyses use to extract functional meaning. We explore a number of consequences of this and find that network structure itself provides clues as to which connections are critical and that systemic properties, such as scale-free-like behaviour, do not map onto the functional connectivity within networks.
The analysis of gene function and gene networks is a major theme of post-genome biomedical research. Historically, many attempts to understand gene function leverage a biological principle known as “guilt by association” (GBA). GBA states that genes with related functions tend to share properties such as genetic or physical interactions. In the past ten years, GBA has been scaled up for application to large gene networks, becoming a favored way to grapple with the complex interdependencies of gene functions in the face of floods of genomics and proteomics data. However, there is a growing realization that scaled-up GBA is not a panacea. In this study, we report a precise identification of the limits of GBA and show that it cannot provide a way to understand gene networks in a way that is simultaneously general and useful. Our findings indicate that the assumptions underlying the high-throughput use of gene networks to interpret function are fundamentally flawed, with wide-ranging implications for the interpretation of genome-wide data.
The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the “excess-dimensionality” of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results.
With the recent advances in high-throughput RNA sequencing (RNA-Seq), biologists are able to measure transcription with unprecedented precision. One problem that can now be tackled is that of isoform quantification: here one tries to reconstruct the abundances of isoforms of a gene. We have developed a statistical solution for this problem, based on analyzing a set of RNA-Seq reads, and a practical implementation, available from archive.gersteinlab.org/proj/rnaseq/IQSeq, in a tool we call IQSeq (Isoform Quantification in next-generation Sequencing). Here, we present theoretical results which IQSeq is based on, and then use both simulated and real datasets to illustrate various applications of the tool. In order to measure the accuracy of an isoform-quantification result, one would try to estimate the average variance of the estimated isoform abundances for each gene (based on resampling the RNA-seq reads), and IQSeq has a particularly fast algorithm (based on the Fisher Information Matrix) for calculating this, achieving a speedup of times compared to brute-force resampling. IQSeq also calculates an information theoretic measure of overall transcriptome complexity to describe isoform abundance for a whole experiment. IQSeq has many features that are particularly useful in RNA-Seq experimental design, allowing one to optimally model the integration of different sequencing technologies in a cost-effective way. In particular, the IQSeq formalism integrates the analysis of different sample (i.e. read) sets generated from different technologies within the same statistical framework. It also supports a generalized statistical partial-sample-generation function to model the sequencing process. This allows one to have a modular, “plugin-able” read-generation function to support the particularities of the many evolving sequencing technologies.
Web-based, free-text documents on science and technology have been increasing growing on the web. However, most of these documents are not immediately processable by computers slowing down the acquisition of useful information. Computational ontologies might represent a possible solution by enabling semantically machine readable data sets. But, the process of ontology creation, instantiation and maintenance is still based on manual methodologies and thus time and cost intensive.
We focused on a large corpus containing information on researchers, research fields, and institutions. We based our strategy on traditional entity recognition, social computing and correlation. We devised a semi automatic approach for the recognition, correlation and extraction of named entities and relations from textual documents which are then used to create, instantiate, and maintain an ontology.
We present a prototype demonstrating the applicability of the proposed strategy, along with a case study describing how direct and indirect relations can be extracted from academic and professional activities registered in a database of curriculum vitae in free-text format. We present evidence that this system can identify entities to assist in the process of knowledge extraction and representation to support ontology maintenance. We also demonstrate the extraction of relationships among ontology classes and their instances.
We have demonstrated that our system can be used for the conversion of research information in free text format into database with a semantic structure. Future studies should test this system using the growing number of free-text information available at the institutional and national levels.