PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1290711)

Clipboard (0)
None

Related Articles

1.  Graphite Web: web tool for gene set analysis exploiting pathway topology 
Nucleic Acids Research  2013;41(Web Server issue):W89-W97.
Graphite web is a novel web tool for pathway analyses and network visualization for gene expression data of both microarray and RNA-seq experiments. Several pathway analyses have been proposed either in the univariate or in the global and multivariate context to tackle the complexity and the interpretation of expression results. These methods can be further divided into ‘topological’ and ‘non-topological’ methods according to their ability to gain power from pathway topology. Biological pathways are, in fact, not only gene lists but can be represented through a network where genes and connections are, respectively, nodes and edges. To this day, the most used approaches are non-topological and univariate although they miss the relationship among genes. On the contrary, topological and multivariate approaches are more powerful, but difficult to be used by researchers without bioinformatic skills. Here we present Graphite web, the first public web server for pathway analysis on gene expression data that combines topological and multivariate pathway analyses with an efficient system of interactive network visualizations for easy results interpretation. Specifically, Graphite web implements five different gene set analyses on three model organisms and two pathway databases. Graphite Web is freely available at http://graphiteweb.bio.unipd.it/.
doi:10.1093/nar/gkt386
PMCID: PMC3977659  PMID: 23666626
2.  KEGGgraph: a graph approach to KEGG PATHWAY in R and bioconductor 
Bioinformatics  2009;25(11):1470-1471.
Motivation: KEGG PATHWAY is a service of Kyoto Encyclopedia of Genes and Genomes (KEGG), constructing manually curated pathway maps that represent current knowledge on biological networks in graph models. While valuable graph tools have been implemented in R/Bioconductor, to our knowledge there is currently no software package to parse and analyze KEGG pathways with graph theory.
Results: We introduce the software package KEGGgraph in R and Bioconductor, an interface between KEGG pathways and graph models as well as a collection of tools for these graphs. Superior to existing approaches, KEGGgraph captures the pathway topology and allows further analysis or dissection of pathway graphs. We demonstrate the use of the package by the case study of analyzing human pancreatic cancer pathway.
Availability:KEGGgraph is freely available at the Bioconductor web site (http://www.bioconductor.org). KGML files can be downloaded from KEGG FTP site (ftp://ftp.genome.jp/pub/kegg/xml).
Contact: j.zhang@dkfz-heidelberg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp167
PMCID: PMC2682514  PMID: 19307239
3.  Creating and analyzing pathway and protein interaction compendia for modelling signal transduction networks 
BMC Systems Biology  2012;6:29.
Background
Understanding the information-processing capabilities of signal transduction networks, how those networks are disrupted in disease, and rationally designing therapies to manipulate diseased states require systematic and accurate reconstruction of network topology. Data on networks central to human physiology, such as the inflammatory signalling networks analyzed here, are found in a multiplicity of on-line resources of pathway and interactome databases (Cancer CellMap, GeneGo, KEGG, NCI-Pathway Interactome Database (NCI-PID), PANTHER, Reactome, I2D, and STRING). We sought to determine whether these databases contain overlapping information and whether they can be used to construct high reliability prior knowledge networks for subsequent modeling of experimental data.
Results
We have assembled an ensemble network from multiple on-line sources representing a significant portion of all machine-readable and reconcilable human knowledge on proteins and protein interactions involved in inflammation. This ensemble network has many features expected of complex signalling networks assembled from high-throughput data: a power law distribution of both node degree and edge annotations, and topological features of a “bow tie” architecture in which diverse pathways converge on a highly conserved set of enzymatic cascades focused around PI3K/AKT, MAPK/ERK, JAK/STAT, NFκB, and apoptotic signaling. Individual pathways exhibit “fuzzy” modularity that is statistically significant but still involving a majority of “cross-talk” interactions. However, we find that the most widely used pathway databases are highly inconsistent with respect to the actual constituents and interactions in this network. Using a set of growth factor signalling networks as examples (epidermal growth factor, transforming growth factor-beta, tumor necrosis factor, and wingless), we find a multiplicity of network topologies in which receptors couple to downstream components through myriad alternate paths. Many of these paths are inconsistent with well-established mechanistic features of signalling networks, such as a requirement for a transmembrane receptor in sensing extracellular ligands.
Conclusions
Wide inconsistencies among interaction databases, pathway annotations, and the numbers and identities of nodes associated with a given pathway pose a major challenge for deriving causal and mechanistic insight from network graphs. We speculate that these inconsistencies are at least partially attributable to cell, and context-specificity of cellular signal transduction, which is largely unaccounted for in available databases, but the absence of standardized vocabularies is an additional confounding factor. As a result of discrepant annotations, it is very difficult to identify biologically meaningful pathways from interactome networks a priori. However, by incorporating prior knowledge, it is possible to successively build out network complexity with high confidence from a simple linear signal transduction scaffold. Such reduced complexity networks appear suitable for use in mechanistic models while being richer and better justified than the simple linear pathways usually depicted in diagrams of signal transduction.
doi:10.1186/1752-0509-6-29
PMCID: PMC3436686  PMID: 22548703
4.  Detecting and Removing Inconsistencies between Experimental Data and Signaling Network Topologies Using Integer Linear Programming on Interaction Graphs 
PLoS Computational Biology  2013;9(9):e1003204.
Cross-referencing experimental data with our current knowledge of signaling network topologies is one central goal of mathematical modeling of cellular signal transduction networks. We present a new methodology for data-driven interrogation and training of signaling networks. While most published methods for signaling network inference operate on Bayesian, Boolean, or ODE models, our approach uses integer linear programming (ILP) on interaction graphs to encode constraints on the qualitative behavior of the nodes. These constraints are posed by the network topology and their formulation as ILP allows us to predict the possible qualitative changes (up, down, no effect) of the activation levels of the nodes for a given stimulus. We provide four basic operations to detect and remove inconsistencies between measurements and predicted behavior: (i) find a topology-consistent explanation for responses of signaling nodes measured in a stimulus-response experiment (if none exists, find the closest explanation); (ii) determine a minimal set of nodes that need to be corrected to make an inconsistent scenario consistent; (iii) determine the optimal subgraph of the given network topology which can best reflect measurements from a set of experimental scenarios; (iv) find possibly missing edges that would improve the consistency of the graph with respect to a set of experimental scenarios the most. We demonstrate the applicability of the proposed approach by interrogating a manually curated interaction graph model of EGFR/ErbB signaling against a library of high-throughput phosphoproteomic data measured in primary hepatocytes. Our methods detect interactions that are likely to be inactive in hepatocytes and provide suggestions for new interactions that, if included, would significantly improve the goodness of fit. Our framework is highly flexible and the underlying model requires only easily accessible biological knowledge. All related algorithms were implemented in a freely available toolbox SigNetTrainer making it an appealing approach for various applications.
Author Summary
Cellular signal transduction is orchestrated by communication networks of signaling proteins commonly depicted on signaling pathway maps. However, each cell type may have distinct variants of signaling pathways, and wiring diagrams are often altered in disease states. The identification of truly active signaling topologies based on experimental data is therefore one key challenge in systems biology of cellular signaling. We present a new framework for training signaling networks based on interaction graphs (IG). In contrast to complex modeling formalisms, IG capture merely the known positive and negative edges between the components. This basic information, however, already sets hard constraints on the possible qualitative behaviors of the nodes when perturbing the network. Our approach uses Integer Linear Programming to encode these constraints and to predict the possible changes (down, neutral, up) of the activation levels of the involved players for a given experiment. Based on this formulation we developed several algorithms for detecting and removing inconsistencies between measurements and network topology. Demonstrated by EGFR/ErbB signaling in hepatocytes, our approach delivers direct conclusions on edges that are likely inactive or missing relative to canonical pathway maps. Such information drives the further elucidation of signaling network topologies under normal and pathological phenotypes.
doi:10.1371/journal.pcbi.1003204
PMCID: PMC3764019  PMID: 24039561
5.  NuChart: An R Package to Study Gene Spatial Neighbourhoods with Multi-Omics Annotations 
PLoS ONE  2013;8(9):e75146.
Long-range chromosomal associations between genomic regions, and their repositioning in the 3D space of the nucleus, are now considered to be key contributors to the regulation of gene expression and important links have been highlighted with other genomic features involved in DNA rearrangements. Recent Chromosome Conformation Capture (3C) measurements performed with high throughput sequencing (Hi-C) and molecular dynamics studies show that there is a large correlation between colocalization and coregulation of genes, but these important researches are hampered by the lack of biologists-friendly analysis and visualisation software. Here, we describe NuChart, an R package that allows the user to annotate and statistically analyse a list of input genes with information relying on Hi-C data, integrating knowledge about genomic features that are involved in the chromosome spatial organization. NuChart works directly with sequenced reads to identify the related Hi-C fragments, with the aim of creating gene-centric neighbourhood graphs on which multi-omics features can be mapped. Predictions about CTCF binding sites, isochores and cryptic Recombination Signal Sequences are provided directly with the package for mapping, although other annotation data in bed format can be used (such as methylation profiles and histone patterns). Gene expression data can be automatically retrieved and processed from the Gene Expression Omnibus and ArrayExpress repositories to highlight the expression profile of genes in the identified neighbourhood. Moreover, statistical inferences about the graph structure and correlations between its topology and multi-omics features can be performed using Exponential-family Random Graph Models. The Hi-C fragment visualisation provided by NuChart allows the comparisons of cells in different conditions, thus providing the possibility of novel biomarkers identification. NuChart is compliant with the Bioconductor standard and it is freely available at ftp://fileserver.itb.cnr.it/nuchart.
doi:10.1371/journal.pone.0075146
PMCID: PMC3777921  PMID: 24069388
6.  Automated identification of pathways from quantitative genetic interaction data 
We present a novel Bayesian learning method that reconstructs large detailed gene networks from quantitative genetic interaction (GI) data.The method uses global reasoning to handle missing and ambiguous measurements, and provide confidence estimates for each prediction.Applied to a recent data set over genes relevant to protein folding, the learned networks reflect known biological pathways, including details such as pathway ordering and directionality of relationships.The reconstructed networks also suggest novel relationships, including the placement of SGT2 in the tail-anchored biogenesis pathway, a finding that we experimentally validated.
Recent developments have enabled large-scale quantitative measurement of genetic interactions (GIs) that report on the extent to which the activity of one gene is dependent on a second. It has long been recognized (Avery and Wasserman, 1992; Hartman et al, 2001; Segre et al, 2004; Tong et al, 2004; Drees et al, 2005; Schuldiner et al, 2005; St Onge et al, 2007; Costanzo et al, 2010) that functional dependencies revealed by GI data can provide rich information regarding underlying biological pathways. Further, the precise phenotypic measurements provided by quantitative GI data can provide evidence for even more detailed aspects of pathway structure, such as differentiating between full and partial dependence between two genes (Drees et al, 2005; Schuldiner et al, 2005; St Onge et al, 2007; Jonikas et al, 2009) (Figure 1A). As GI data sets become available for a range of quantitative phenotypes and organisms, such patterns will allow researchers to elucidate pathways important to a diverse set of biological processes.
We present a new method that exploits the high-quality, quantitative nature of recent GI assays to automatically reconstruct detailed multi-gene pathway structures, including the organization of a large set of genes into coherent pathways, the connectivity and ordering within each pathway, and the directionality of each relationship. We introduce activity pathway networks (APNs), which represent functional dependencies among a set of genes in the form of a network. We present an automatic method to efficiently reconstruct APNs over large sets of genes based on quantitative GI measurements. This method handles uncertainty in the data arising from noise, missing measurements, and data points with ambiguous interpretations, by performing global reasoning that combines evidence from multiple data points. In addition, because some structure choices remain uncertain even when jointly considering all measurements, our method maintains multiple likely networks, and allows computation of confidence estimates over each structure choice.
We applied our APN reconstruction method to the recent high-quality GI data set of Jonikas et al (2009), which examined the functional interaction between genes that contribute to protein folding in the ER. Specifically, Jonikas et al used the cell's endogenous sensor (the unfolded protein response), to first identify several hundred yeast genes with functions in endoplasmic reticulum folding and then systematically characterized their functional interdependencies by measuring unfolded protein response levels in double mutants. Our analysis produced an ensemble of 500 likelihood-weighted APNs over 178 genes (Figure 2).
We performed an aggregate evaluation of our results by comparing to known biological relationships between gene pairs, including participation in pathways according to the Kyoto Encyclopedia of Genes and Genomes (KEGG), correlation of chemical genomic profiles in a recent high-throughput assay (Hillenmeyer et al, 2008) and similarity of Gene Ontology (GO) annotations. In each evaluation performed, our reconstructed APNs were significantly more consistent with the known relationships than either the raw GI values or the Pearson correlation between profiles of GI values.
Importantly, our approach provides not only an improved means for defining pairs or groups of related genes, but also enables the identification of detailed multi-gene network structures. In many cases, our method successfully reconstructed known cellular pathways, including the ER-associated degradation (ERAD) pathway, and the biosynthesis of N-linked glycans, ranking them among the highest confidence structures. In-depth examination of the learned network structures indicates agreement with many known details of these pathways. In addition, quantitative analysis indicates that our learned APNs are indicative of ordering within KEGG-annotated biological pathways.
Our results also suggest several novel relationships, including placement of uncharacterized genes into pathways, and novel relationships between characterized genes. These include the dependence of the J domain chaperone JEM1 on the PDI homolog MPD1, dependence of the Ubiquitin-recycling enzyme DOA4 on N-linked glycosylation, and the dependence of the E3 Ubiquitin ligase DOA10 on the signal peptidase complex subunit SPC2. Our APNs also place the poorly characterized TPR-containing protein SGT2 upstream of the tail-anchored protein biogenesis machinery components GET3, GET4, and MDY2 (also known as GET5), suggesting that SGT2 has a function in the insertion of tail-anchored proteins into membranes. Consistent with this prediction, our experimental analysis shows that sgt2Δ cells show a defect in localization of the tail-anchored protein GFP-Sed5 from punctuate Golgi structures to a more diffuse pattern, as seen in other genes involved in this pathway.
Our results show that multi-gene, detailed pathway networks can be reconstructed from quantitative GI data, providing a concrete computational manifestation to intuitions that have traditionally accompanied the manual interpretation of such data. Ongoing technological developments in both genetics and imaging are enabling the measurement of GI data at a genome-wide scale, using high-accuracy quantitative phenotypes that relate to a range of particular biological functions. Methods based on RNAi will soon allow collection of similar data for human cell lines and other mammalian systems (Moffat et al, 2006). Thus, computational methods for analyzing GI data could have an important function in mapping pathways involved in complex biological systems including human cells.
High-throughput quantitative genetic interaction (GI) measurements provide detailed information regarding the structure of the underlying biological pathways by reporting on functional dependencies between genes. However, the analytical tools for fully exploiting such information lag behind the ability to collect these data. We present a novel Bayesian learning method that uses quantitative phenotypes of double knockout organisms to automatically reconstruct detailed pathway structures. We applied our method to a recent data set that measures GIs for endoplasmic reticulum (ER) genes, using the unfolded protein response as a quantitative phenotype. The results provided reconstructions of known functional pathways including N-linked glycosylation and ER-associated protein degradation. It also contained novel relationships, such as the placement of SGT2 in the tail-anchored biogenesis pathway, a finding that we experimentally validated. Our approach should be readily applicable to the next generation of quantitative GI data sets, as assays become available for additional phenotypes and eventually higher-level organisms.
doi:10.1038/msb.2010.27
PMCID: PMC2913392  PMID: 20531408
computational biology; genetic interaction; pathway reconstruction; probabilistic methods
7.  Emergence of Switch-Like Behavior in a Large Family of Simple Biochemical Networks 
PLoS Computational Biology  2011;7(5):e1002039.
Bistability plays a central role in the gene regulatory networks (GRNs) controlling many essential biological functions, including cellular differentiation and cell cycle control. However, establishing the network topologies that can exhibit bistability remains a challenge, in part due to the exceedingly large variety of GRNs that exist for even a small number of components. We begin to address this problem by employing chemical reaction network theory in a comprehensive in silico survey to determine the capacity for bistability of more than 40,000 simple networks that can be formed by two transcription factor-coding genes and their associated proteins (assuming only the most elementary biochemical processes). We find that there exist reaction rate constants leading to bistability in ∼90% of these GRN models, including several circuits that do not contain any of the TF cooperativity commonly associated with bistable systems, and the majority of which could only be identified as bistable through an original subnetwork-based analysis. A topological sorting of the two-gene family of networks based on the presence or absence of biochemical reactions reveals eleven minimal bistable networks (i.e., bistable networks that do not contain within them a smaller bistable subnetwork). The large number of previously unknown bistable network topologies suggests that the capacity for switch-like behavior in GRNs arises with relative ease and is not easily lost through network evolution. To highlight the relevance of the systematic application of CRNT to bistable network identification in real biological systems, we integrated publicly available protein-protein interaction, protein-DNA interaction, and gene expression data from Saccharomyces cerevisiae, and identified several GRNs predicted to behave in a bistable fashion.
Author Summary
Switch-like behavior is found across a wide range of biological systems, and as a result there is significant interest in identifying the various ways in which biochemical reactions can be combined to yield a switch-like response. In this work we use a set of mathematical tools from chemical reaction network theory that provide information about the steady-states of a reaction network irrespective of the values of network rate constants, to conduct a large computational study of a family of model networks consisting of only two protein-coding genes. We find that a large majority of these networks (∼90%) have (for some set of parameters) the mathematical property known as bistability and can behave in a switch-like manner. Interestingly, the capacity for switch-like behavior is often maintained as networks increase in size through the introduction of new reactions. We then demonstrate using published yeast data how theoretical parameter-free surveys such as this one can be used to discover possible switch-like circuits in real biological systems. Our results highlight the potential usefulness of parameter-free modeling for the characterization of complex networks and to the study of network evolution, and are suggestive of a role for it in the development of novel synthetic biological switches.
doi:10.1371/journal.pcbi.1002039
PMCID: PMC3093349  PMID: 21589886
8.  Chemical combination effects predict connectivity in biological systems 
Chemical synergies can be novel probes of biological systems.Simulated response shapes depend on target connectivity in a pathway.Experiments with yeast and cancer cells confirm simulated effects.Profiles across many combinations yield target location information.
Living organisms are built of interacting components, whose function and dysfunction can be described through dynamic network models (Davidson et al, 2002). Systems Biology involves the iterative construction of such models (Ideker et al, 2001), and may eventually improve the understanding of diseases using in silico simulations. Such simulations may eventually permit drugs to be prioritized for clinical trials, reducing potential risks and increasing the likelihood of successful outcomes. Given the complexity of biological systems, constructing realistic models will require large and diverse sets of connectivity data.
Chemical combinations provide a new window into biological connectivity. Information gleaned from targeted combinations, such as paired mutations (Tong et al, 2004), has proven to be especially useful for revealing functional interactions between components. We have been screening chemical combinations for therapeutic synergies (Borisy et al, 2003; Zimmermann et al, 2007), collecting full-dose matrices where combinations are tested in all possible pairings of serially diluted single agent doses (Figure 1). Such screens yield a variety of response surfaces with distinct shapes for combinations that work through different known mechanisms, suggesting that combination effects may contain information on the nature of functional connections between drug targets.
Simulations of biological pathways predict synergistic responses to inhibitors that depend on target connectivity. We explored theoretical predictions by simulating a metabolic pathway with pairs of inhibitors aimed at different targets with varying doses. We found that the shape of each combination response depended on how the inhibitor pair's targets were connected in the pathway (Figure 2). The predicted response shapes were robust to plausible variations in the simulated pathway that did not affect the network topology (e.g., kinetic assumptions, parameter values, and nonlinear response functions), but were very sensitive to topological alterations in the modelled network (e.g., feedback regulation or changing the type of junction at a branch point). These findings suggest that connectivity of the inhibitor targets has a major influence on combination response morphology.
The predicted shapes were experimentally confirmed in yeast combination experiments. The proliferation experiment used drugs focused on the sterol biosynthesis pathway, which is mostly linear between the targets covered in this study, and is known to be regulated by negative feedback (Gardner et al, 2001). The combinations between sterol inhibitors confirmed expectations from our simulations, showing dose-additive responses for pairs targeting the same enzyme and strong synergies across enzymes of the shape predicted in our simulations for linear pathways under negative feedback. Combinations across pathways showed much more variable responses with a trend towards less synergy on average.
Further experimental support was obtained from human cells. A combination screen of 90 annotated drugs in a human tumour cell line (HCT116) proliferation assay produced strong synergies for combinations within pathways and more variable effects between targeted functions. Synergy profiles (sets of all synergy scores involving each drug) also showed a greater degree of similarity for pairs of drugs with related targets. Finally, the most extreme outliers were dominated by inhibitors of kinases that are especially critical for HCT116 proliferation (Awwad et al, 2003), with effects that are consistent across mechanistic replicates, showing that chemical combinations can highlight biologically relevant cellular processes.
This study demonstrates the potential of chemical combinations for exploring functional connectivity in biological systems. This information complements genetic studies by providing more details through variable dosing, by directly targeting single domains of multi-domain proteins, and by probing cell types that are not amenable to mutagenesis. Responses from large chemical combination screens can be used to identify molecular targets through chemical–genetic profiling (Macdonald et al, 2006), or to directly constrain network models by means of a prediction-validation procedure (Ideker et al, 2001). This initial exploration can be extended to cover a wider range of response shapes and network topologies, as well as to combinations of three or more chemical agents. Moreover, this approach may even be applicable to non-biological systems where responses to targeted perturbations can be measured.
Efforts to construct therapeutically useful models of biological systems require large and diverse sets of data on functional connections between their components. Here we show that cellular responses to combinations of chemicals reveal how their biological targets are connected. Simulations of pathways with pairs of inhibitors at varying doses predict distinct response surface shapes that are reproduced in a yeast experiment, with further support from a larger screen using human tumour cells. The response morphology yields detailed connectivity constraints between nearby targets, and synergy profiles across many combinations show relatedness between targets in the whole network. Constraints from chemical combinations complement genetic studies, because they probe different cellular components and can be applied to disease models that are not amenable to mutagenesis. Chemical probes also offer increased flexibility, as they can be continuously dosed, temporally controlled, and readily combined. After extending this initial study to cover a wider range of combination effects and pathway topologies, chemical combinations may be used to refine network models or to identify novel targets. This response surface methodology may even apply to non-biological systems where responses to targeted perturbations can be measured.
doi:10.1038/msb4100116
PMCID: PMC1828746  PMID: 17332758
chemical genetics; combinations and synergy; metabolic and regulatory networks; simulation and data analysis
9.  Structural Measures for Network Biology Using QuACN 
BMC Bioinformatics  2011;12:492.
Background
Structural measures for networks have been extensively developed, but many of them have not yet demonstrated their sustainably. That means, it remains often unclear whether a particular measure is useful and feasible to solve a particular problem in network biology. Exemplarily, the classification of complex biological networks can be named, for which structural measures are used leading to a minimal classification error. Hence, there is a strong need to provide freely available software packages to calculate and demonstrate the appropriate usage of structural graph measures in network biology.
Results
Here, we discuss topological network descriptors that are implemented in the R-package QuACN and demonstrate their behavior and characteristics by applying them to a set of example graphs. Moreover, we show a representative application to illustrate their capabilities for classifying biological networks. In particular, we infer gene regulatory networks from microarray data and classify them by methods provided by QuACN. Note that QuACN is the first freely available software written in R containing a large number of structural graph measures.
Conclusion
The R package QuACN is under ongoing development and we add promising groups of topological network descriptors continuously. The package can be used to answer intriguing research questions in network biology, e.g., classifying biological data or identifying meaningful biological features, by analyzing the topology of biological networks.
doi:10.1186/1471-2105-12-492
PMCID: PMC3293850  PMID: 22195644
10.  A Graphical Modelling Approach to the Dissection of Highly Correlated Transcription Factor Binding Site Profiles 
PLoS Computational Biology  2012;8(11):e1002725.
Inferring the combinatorial regulatory code of transcription factors (TFs) from genome-wide TF binding profiles is challenging. A major reason is that TF binding profiles significantly overlap and are therefore highly correlated. Clustered occurrence of multiple TFs at genomic sites may arise from chromatin accessibility and local cooperation between TFs, or binding sites may simply appear clustered if the profiles are generated from diverse cell populations. Overlaps in TF binding profiles may also result from measurements taken at closely related time intervals. It is thus of great interest to distinguish TFs that directly regulate gene expression from those that are indirectly associated with gene expression. Graphical models, in particular Bayesian networks, provide a powerful mathematical framework to infer different types of dependencies. However, existing methods do not perform well when the features (here: TF binding profiles) are highly correlated, when their association with the biological outcome is weak, and when the sample size is small. Here, we develop a novel computational method, the Neighbourhood Consistent PC (NCPC) algorithms, which deal with these scenarios much more effectively than existing methods do. We further present a novel graphical representation, the Direct Dependence Graph (DDGraph), to better display the complex interactions among variables. NCPC and DDGraph can also be applied to other problems involving highly correlated biological features. Both methods are implemented in the R package ddgraph, available as part of Bioconductor (http://bioconductor.org/packages/2.11/bioc/html/ddgraph.html). Applied to real data, our method identified TFs that specify different classes of cis-regulatory modules (CRMs) in Drosophila mesoderm differentiation. Our analysis also found depletion of the early transcription factor Twist binding at the CRMs regulating expression in visceral and somatic muscle cells at later stages, which suggests a CRM-specific repression mechanism that so far has not been characterised for this class of mesodermal CRMs.
Author Summary
Transcription factors (TFs) are proteins that bind to DNA and regulate gene expression. Recent technological advances make it possible to map TF binding patterns across the whole genome. Multiple single-gene studies showed that combinatorial binding of multiple transcription factors determines the gene transcriptional output. A common naive assumption is that correlated binding profiles may indicate combinatorial binding. However, it has been found that many TFs bind to distinct hotspots whose role is currently unclear. It is thus of great interest to find transcription factor combinations whose correlated binding is causally most immediate to gene expression. Building upon theories of statistical dependence and causality, we develop novel graphical modelbased algorithms that handle highly correlated transcription factor binding profiles more efficiently and reliably than existing algorithms do. These algorithms can also be applied to other biological areas involving highly correlated variables, such as the analysis of high-throughput gene knock-down experiments.
doi:10.1371/journal.pcbi.1002725
PMCID: PMC3493460  PMID: 23144600
11.  minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information 
BMC Bioinformatics  2008;9:461.
Results
This paper presents the R/Bioconductor package minet (version 1.1.6) which provides a set of functions to infer mutual information networks from a dataset. Once fed with a microarray dataset, the package returns a network where nodes denote genes, edges model statistical dependencies between genes and the weight of an edge quantifies the statistical evidence of a specific (e.g transcriptional) gene-to-gene interaction. Four different entropy estimators are made available in the package minet (empirical, Miller-Madow, Schurmann-Grassberger and shrink) as well as four different inference methods, namely relevance networks, ARACNE, CLR and MRNET. Also, the package integrates accuracy assessment tools, like F-scores, PR-curves and ROC-curves in order to compare the inferred network with a reference one.
Conclusion
The package minet provides a series of tools for inferring transcriptional networks from microarray data. It is freely available from the Comprehensive R Archive Network (CRAN) as well as from the Bioconductor website.
doi:10.1186/1471-2105-9-461
PMCID: PMC2630331  PMID: 18959772
12.  Signalling Network Construction for Modelling Plant Defence Response 
PLoS ONE  2012;7(12):e51822.
Plant defence signalling response against various pathogens, including viruses, is a complex phenomenon. In resistant interaction a plant cell perceives the pathogen signal, transduces it within the cell and performs a reprogramming of the cell metabolism leading to the pathogen replication arrest. This work focuses on signalling pathways crucial for the plant defence response, i.e., the salicylic acid, jasmonic acid and ethylene signal transduction pathways, in the Arabidopsis thaliana model plant. The initial signalling network topology was constructed manually by defining the representation formalism, encoding the information from public databases and literature, and composing a pathway diagram. The manually constructed network structure consists of 175 components and 387 reactions. In order to complement the network topology with possibly missing relations, a new approach to automated information extraction from biological literature was developed. This approach, named Bio3graph, allows for automated extraction of biological relations from the literature, resulting in a set of (component1, reaction, component2) triplets and composing a graph structure which can be visualised, compared to the manually constructed topology and examined by the experts. Using a plant defence response vocabulary of components and reaction types, Bio3graph was applied to a set of 9,586 relevant full text articles, resulting in 137 newly detected reactions between the components. Finally, the manually constructed topology and the new reactions were merged to form a network structure consisting of 175 components and 524 reactions. The resulting pathway diagram of plant defence signalling represents a valuable source for further computational modelling and interpretation of omics data. The developed Bio3graph approach, implemented as an executable language processing and graph visualisation workflow, is publically available at http://ropot.ijs.si/bio3graph/and can be utilised for modelling other biological systems, given that an adequate vocabulary is provided.
doi:10.1371/journal.pone.0051822
PMCID: PMC3525666  PMID: 23272172
13.  A comprehensive network and pathway analysis of candidate genes in major depressive disorder 
BMC Systems Biology  2011;5(Suppl 3):S12.
Background
Numerous genetic and genomic datasets related to complex diseases have been made available during the last decade. It is now a great challenge to assess such heterogeneous datasets to prioritize disease genes and perform follow up functional analysis and validation. Among complex disease studies, psychiatric disorders such as major depressive disorder (MDD) are especially in need of robust integrative analysis because these diseases are more complex than others, with weak genetic factors at various levels, including genetic markers, transcription (gene expression), epigenetics (methylation), protein, pathways and networks.
Results
In this study, we proposed a comprehensive analysis framework at the systems level and demonstrated it in MDD using a set of candidate genes that have recently been prioritized based on multiple lines of evidence including association, linkage, gene expression (both human and animal studies), regulatory pathway, and literature search. In the network analysis, we explored the topological characteristics of these genes in the context of the human interactome and compared them with two other complex diseases. The network topological features indicated that MDD is similar to schizophrenia compared to cancer. In the functional analysis, we performed the gene set enrichment analysis for both Gene Ontology categories and canonical pathways. Moreover, we proposed a unique pathway crosstalk approach to examine the dynamic interactions among biological pathways. Our pathway enrichment and crosstalk analyses revealed two unique pathway interaction modules that were significantly enriched with MDD genes. These two modules are neuro-transmission and immune system related, supporting the neuropathology hypothesis of MDD. Finally, we constructed a MDD-specific subnetwork, which recruited novel candidate genes with association signals from a major MDD GWAS dataset.
Conclusions
This study is the first systematic network and pathway analysis of candidate genes in MDD, providing abundant important information about gene interaction and regulation in a major psychiatric disease. The results suggest potential functional components underlying the molecular mechanisms of MDD and, thus, facilitate generation of novel hypotheses in this disease. The systems biology based strategy in this study can be applied to many other complex diseases.
doi:10.1186/1752-0509-5-S3-S12
PMCID: PMC3287567  PMID: 22784618
14.  puma 3.0: improved uncertainty propagation methods for gene and transcript expression analysis 
BMC Bioinformatics  2013;14:39.
Background
Microarrays have been a popular tool for gene expression profiling at genome-scale for over a decade due to the low cost, short turn-around time, excellent quantitative accuracy and ease of data generation. The Bioconductor package puma incorporates a suite of analysis methods for determining uncertainties from Affymetrix GeneChip data and propagating these uncertainties to downstream analysis. As isoform level expression profiling receives more and more interest within genomics in recent years, exon microarray technology offers an important tool to quantify expression level of the majority of exons and enables the possibility of measuring isoform level expression. However, puma does not include methods for the analysis of exon array data. Moreover, the current expression summarisation method for Affymetrix 3’ GeneChip data suffers from instability for low expression genes. For the downstream analysis, the method for differential expression detection is computationally intensive and the original expression clustering method does not consider the variance across the replicated technical and biological measurements. It is therefore necessary to develop improved uncertainty propagation methods for gene and transcript expression analysis.
Results
We extend the previously developed Bioconductor package puma with a new method especially designed for GeneChip Exon arrays and a set of improved downstream approaches. The improvements include: (i) a new gamma model for exon arrays which calculates isoform and gene expression measurements and a level of uncertainty associated with the estimates, using the multi-mappings between probes, isoforms and genes, (ii) a variant of the existing approach for the probe-level analysis of Affymetrix 3’ GeneChip data to produce more stable gene expression estimates, (iii) an improved method for detecting differential expression which is computationally more efficient than the existing approach in the package and (iv) an improved method for robust model-based clustering of gene expression, which takes technical and biological replicate information into consideration.
Conclusions
With the extensions and improvements, the puma package is now applicable to the analysis of both Affymetrix 3’ GeneChips and Exon arrays for gene and isoform expression estimation. It propagates the uncertainty of expression measurements into more efficient and comprehensive downstream analysis at both gene and isoform level. Downstream methods are also applicable to other expression quantification platforms, such as RNA-Seq, when uncertainty information is available from expression measurements. puma is available through Bioconductor and can be found at http://www.bioconductor.org.
doi:10.1186/1471-2105-14-39
PMCID: PMC3626802  PMID: 23379655
15.  Organization of Physical Interactomes as Uncovered by Network Schemas 
PLoS Computational Biology  2008;4(10):e1000203.
Large-scale protein-protein interaction networks provide new opportunities for understanding cellular organization and functioning. We introduce network schemas to elucidate shared mechanisms within interactomes. Network schemas specify descriptions of proteins and the topology of interactions among them. We develop algorithms for systematically uncovering recurring, over-represented schemas in physical interaction networks. We apply our methods to the S. cerevisiae interactome, focusing on schemas consisting of proteins described via sequence motifs and molecular function annotations and interacting with one another in one of four basic network topologies. We identify hundreds of recurring and over-represented network schemas of various complexity, and demonstrate via graph-theoretic representations how more complex schemas are organized in terms of their lower-order constituents. The uncovered schemas span a wide range of cellular activities, with many signaling and transport related higher-order schemas. We establish the functional importance of the schemas by showing that they correspond to functionally cohesive sets of proteins, are enriched in the frequency with which they have instances in the H. sapiens interactome, and are useful for predicting protein function. Our findings suggest that network schemas are a powerful paradigm for organizing, interrogating, and annotating cellular networks.
Author Summary
Large-scale networks of protein-protein interactions provide a view into the workings of the cell. However, these interaction maps do not come with a key for interpreting them, so it is necessary to develop methods that shed light on their functioning and organization. We propose the language of network schemas for describing recurring patterns of specific types of proteins and their interactions. That is, network schemas describe proteins and specify the topology of interactions among them. A single network schema can describe, for example, a common template that underlies several distinct cellular pathways, such as signaling pathways. We develop a computational methodology for identifying network schemas that are recurrent and over-represented in the network, even given the distributions of their constituent components. We apply this methodology to the physical interaction network in S. cerevisiae and begin to build a hierarchy of schemas starting with the four simplest topologies. We validate the biological relevance of the schemas that we find, discuss the insights our findings lend into the organization of interactomes, touch upon cross-genomic aspects of schema analysis, and show how to use schemas to annotate uncharacterized protein families.
doi:10.1371/journal.pcbi.1000203
PMCID: PMC2561054  PMID: 18949022
16.  HepatoNet1: a comprehensive metabolic reconstruction of the human hepatocyte for the analysis of liver physiology 
We present HepatoNet1, a manually curated large-scale metabolic network of the human hepatocyte that encompasses >2500 reactions in six intracellular and two extracellular compartments.Using constraint-based modeling techniques, the network has been validated to replicate numerous metabolic functions of hepatocytes corresponding to a reference set of diverse physiological liver functions.Taking the detoxification of ammonia and the formation of bile acids as examples, we show how these liver-specific metabolic objectives can be achieved by the variable interplay of various metabolic pathways under varying conditions of nutrients and oxygen availability.
The liver has a pivotal function in metabolic homeostasis of the human body. Hepatocytes are the principal site of the metabolic conversions that underlie diverse physiological functions of the liver. These functions include provision and homeostasis of carbohydrates, amino acids, lipids and lipoproteins in the systemic blood circulation, biotransformation, plasma protein synthesis and bile formation, to name a few. Accordingly, hepatocyte metabolism integrates a vast array of differentially regulated biochemical activities and is highly responsive to environmental perturbations such as changes in portal blood composition (Dardevet et al, 2006). The complexity of this metabolic network and the numerous physiological functions to be achieved within a highly variable physiological environment necessitate an integrated approach with the aim of understanding liver metabolism at a systems level. To this end, we present HepatoNet1, a stoichiometric network of human hepatocyte metabolism characterized by (i) comprehensive coverage of known biochemical activities of hepatocytes and (ii) due representation of the biochemical and physiological functions of hepatocytes as functional network states. The network comprises 777 metabolites in six intracellular (cytosol, endoplasmic reticulum and Golgi apparatus, lysosome, mitochondria, nucleus, and peroxisome) and two extracellular compartments (bile canaliculus and sinusoidal space) and 2539 reactions, including 1466 transport reactions. It is based on the manual evaluation of >1500 original scientific research publications to warrant a high-quality evidence-based model. The final network is the result of an iterative process of data compilation and rigorous computational testing of network functionality by means of constraint-based modeling techniques. We performed flux-balance analyses to validate whether for >300 different metabolic objectives a non-zero stationary flux distribution could be established in the network. Figure 1 shows one such functional flux mode associated with the synthesis of the bile acid glycochenodeoxycholate, one important hepatocyte-specific physiological liver function. Besides those pathways directly linked to the synthesis of the bile acid, the mevalonate pathway and the de novo synthesis of cholesterol, the flux mode comprises additional pathways such as gluconeogenesis, the pentose phosphate pathway or the ornithine cycle because the calculations were routinely performed on a minimal set of exchangeable metabolites, that is all reactants were forced to be balanced and all exportable intermediates had to be catabolized into non-degradable end products. This example shows how HepatoNet1 under the challenges of limited exchange across the network boundary can reveal numerous cross-links between metabolic pathways traditionally perceived as separate entities. For example, alanine is used as gluconeogenic substrate to form glucose-6-phosphate, which is used in the pentose phosphate pathway to generate NADPH. The glycine moiety for bile acid conjugation is derived from serine. Conversion of ammonia into non-toxic nitrogen compounds is one central homeostatic function of hepatocytes. Using the HepatoNet1 model, we investigated, as another example of a complex metabolic objective dependent on systemic physiological parameters, how the consumption of oxygen, glucose and palmitate is affected when an external nitrogen load is converted in varying proportions to the non-toxic nitrogen compounds: urea, glutamine and alanine. The results reveal strong dependencies between the available level of oxygen and the substrate demand of hepatocytes required for effective ammonia detoxification by the liver.
Oxygen demand is highest if nitrogen is exclusively transformed into urea. At lower fluxes into urea, an intriguing pattern for oxygen demand is predicted: oxygen demand attains a minimum if the nitrogen load is directed to urea, glutamine and alanine with relative fluxes of 0.17, 0.43 and 0.40, respectively (Figure 2A). Oxygen demand in this flux distribution is four times lower than for the maximum (100% urea) and still 77 and 33% lower than using alanine and glutamine as exclusive nitrogen compounds, respectively. This computationally predicted tendency is consistent with the notion that the zonation of ammonia detoxification, that is the preferential conversion of ammonia to urea in periportal hepatocytes and to glutamine in perivenous hepatocytes, is dictated by the availability of oxygen (Gebhardt, 1992; Jungermann and Kietzmann, 2000). The decreased oxygen demand in flux distributions using higher proportions of glutamine or alanine is accompanied by increased uptake of the substrates glucose and palmitate (Figure 2B). This is due to an increased demand of energy and carbon for the amidation and transamination of glutamate and pyruvate to discharge nitrogen in the form of glutamine and alanine, respectively. In terms of both scope and specificity, our model bridges the scale between models constructed specifically to examine distinct metabolic processes of the liver and modeling based on a global representation of human metabolism. The former include models for the interdependence of gluconeogenesis and fatty-acid catabolism (Chalhoub et al, 2007), impairment of glucose production in von Gierke's and Hers' diseases (Beard and Qian, 2005) and other processes (Calik and Akbay, 2000; Stucki and Urbanczik, 2005; Ohno et al, 2008). The hallmark of these models is that each of them focuses on a small number of reactions pertinent to the metabolic function of interest embedded in a customized representation of the principal pathways of central metabolism. HepatoNet1, currently, outperforms liver-specific models computationally predicted (Shlomi et al, 2008) on the basis of global reconstructions of human metabolism (Duarte et al, 2007; Ma and Goryanin, 2008). In contrast to either of the aforementioned modeling scales, HepatoNet1 provides the combination of a system-scale representation of metabolic activities and representation of the cell type-specific physical boundaries and their specific transport capacities. This allows for a highly versatile use of the model for the analysis of various liver-specific physiological functions. Conceptually, from a biological system perspective, this type of model offers a large degree of comprehensiveness, whereas retaining tissue specificity, a fundamental design principle of mammalian metabolism. HepatoNet1 is expected to provide a structural platform for computational studies on liver function. The results presented herein highlight how internal fluxes of hepatocyte metabolism and the interplay with systemic physiological parameters can be analyzed with constraint-based modeling techniques. At the same time, the framework may serve as a scaffold for complementation of kinetic and regulatory properties of enzymes and transporters for analysis of sub-networks with topological or kinetic modeling methods.
We present HepatoNet1, the first reconstruction of a comprehensive metabolic network of the human hepatocyte that is shown to accomplish a large canon of known metabolic liver functions. The network comprises 777 metabolites in six intracellular and two extracellular compartments and 2539 reactions, including 1466 transport reactions. It is based on the manual evaluation of >1500 original scientific research publications to warrant a high-quality evidence-based model. The final network is the result of an iterative process of data compilation and rigorous computational testing of network functionality by means of constraint-based modeling techniques. Taking the hepatic detoxification of ammonia as an example, we show how the availability of nutrients and oxygen may modulate the interplay of various metabolic pathways to allow an efficient response of the liver to perturbations of the homeostasis of blood compounds.
doi:10.1038/msb.2010.62
PMCID: PMC2964118  PMID: 20823849
computational biology; flux balance; liver; minimal flux
17.  DFP: a Bioconductor package for fuzzy profile identification and gene reduction of microarray data 
BMC Bioinformatics  2009;10:37.
Background
Expression profiling assays done by using DNA microarray technology generate enormous data sets that are not amenable to simple analysis. The greatest challenge in maximizing the use of this huge amount of data is to develop algorithms to interpret and interconnect results from different genes under different conditions. In this context, fuzzy logic can provide a systematic and unbiased way to both (i) find biologically significant insights relating to meaningful genes, thereby removing the need for expert knowledge in preliminary steps of microarray data analyses and (ii) reduce the cost and complexity of later applied machine learning techniques being able to achieve interpretable models.
Results
DFP is a new Bioconductor R package that implements a method for discretizing and selecting differentially expressed genes based on the application of fuzzy logic. DFP takes advantage of fuzzy membership functions to assign linguistic labels to gene expression levels. The technique builds a reduced set of relevant genes (FP, Fuzzy Pattern) able to summarize and represent each underlying class (pathology). A last step constructs a biased set of genes (DFP, Discriminant Fuzzy Pattern) by intersecting existing fuzzy patterns in order to detect discriminative elements. In addition, the software provides new functions and visualisation tools that summarize achieved results and aid in the interpretation of differentially expressed genes from multiple microarray experiments.
Conclusion
DFP integrates with other packages of the Bioconductor project, uses common data structures and is accompanied by ample documentation. It has the advantage that its parameters are highly configurable, facilitating the discovery of biologically relevant connections between sets of genes belonging to different pathologies. This information makes it possible to automatically filter irrelevant genes thereby reducing the large volume of data supplied by microarray experiments. Based on these contributions GENECBR, a successful tool for cancer diagnosis using microarray datasets, has recently been released.
doi:10.1186/1471-2105-10-37
PMCID: PMC2637236  PMID: 19178723
18.  BisoGenet: a new tool for gene network building, visualization and analysis 
BMC Bioinformatics  2010;11:91.
Background
The increasing availability and diversity of omics data in the post-genomic era offers new perspectives in most areas of biomedical research. Graph-based biological networks models capture the topology of the functional relationships between molecular entities such as gene, protein and small compounds and provide a suitable framework for integrating and analyzing omics-data. The development of software tools capable of integrating data from different sources and to provide flexible methods to reconstruct, represent and analyze topological networks is an active field of research in bioinformatics.
Results
BisoGenet is a multi-tier application for visualization and analysis of biomolecular relationships. The system consists of three tiers. In the data tier, an in-house database stores genomics information, protein-protein interactions, protein-DNA interactions, gene ontology and metabolic pathways. In the middle tier, a global network is created at server startup, representing the whole data on bioentities and their relationships retrieved from the database. The client tier is a Cytoscape plugin, which manages user input, communication with the Web Service, visualization and analysis of the resulting network.
Conclusion
BisoGenet is able to build and visualize biological networks in a fast and user-friendly manner. A feature of Bisogenet is the possibility to include coding relations to distinguish between genes and their products. This feature could be instrumental to achieve a finer grain representation of the bioentities and their relationships. The client application includes network analysis tools and interactive network expansion capabilities. In addition, an option is provided to allow other networks to be converted to BisoGenet. This feature facilitates the integration of our software with other tools available in the Cytoscape platform. BisoGenet is available at http://bio.cigb.edu.cu/bisogenet-cytoscape/.
doi:10.1186/1471-2105-11-91
PMCID: PMC3098113  PMID: 20163717
19.  Posterior Association Networks and Functional Modules Inferred from Rich Phenotypes of Gene Perturbations 
PLoS Computational Biology  2012;8(6):e1002566.
Combinatorial gene perturbations provide rich information for a systematic exploration of genetic interactions. Despite successful applications to bacteria and yeast, the scalability of this approach remains a major challenge for higher organisms such as humans. Here, we report a novel experimental and computational framework to efficiently address this challenge by limiting the ‘search space’ for important genetic interactions. We propose to integrate rich phenotypes of multiple single gene perturbations to robustly predict functional modules, which can subsequently be subjected to further experimental investigations such as combinatorial gene silencing. We present posterior association networks (PANs) to predict functional interactions between genes estimated using a Bayesian mixture modelling approach. The major advantage of this approach over conventional hypothesis tests is that prior knowledge can be incorporated to enhance predictive power. We demonstrate in a simulation study and on biological data, that integrating complementary information greatly improves prediction accuracy. To search for significant modules, we perform hierarchical clustering with multiscale bootstrap resampling. We demonstrate the power of the proposed methodologies in applications to Ewing's sarcoma and human adult stem cells using publicly available and custom generated data, respectively. In the former application, we identify a gene module including many confirmed and highly promising therapeutic targets. Genes in the module are also significantly overrepresented in signalling pathways that are known to be critical for proliferation of Ewing's sarcoma cells. In the latter application, we predict a functional network of chromatin factors controlling epidermal stem cell fate. Further examinations using ChIP-seq, ChIP-qPCR and RT-qPCR reveal that the basis of their genetic interactions may arise from transcriptional cross regulation. A Bioconductor package implementing PAN is freely available online at http://bioconductor.org/packages/release/bioc/html/PANR.html.
Author Summary
Synthetic genetic interactions estimated from combinatorial gene perturbation screens provide systematic insights into synergistic interactions of genes in a biological process. However, this approach lacks scalability for large-scale genetic interaction profiling in metazoan organisms such as humans. We contribute to this field by proposing a more scalable and affordable approach, which takes the advantage of multiple single gene perturbation data to predict coherent functional modules followed by genetic interaction investigation using combinatorial perturbations. We developed a versatile computational framework (PAN) to robustly predict functional interactions and search for significant functional modules from rich phenotyping screens of single gene perturbations under different conditions or from multiple cell lines. PAN features a Bayesian mixture model to assess statistical significance of functional associations, the capability to incorporate prior knowledge as well as a generalized approach to search for significant functional modules by multiscale bootstrap resampling. In applications to Ewing's sarcoma and human adult stem cells, we demonstrate the general applicability and prediction power of PAN to both public and custom generated screening data.
doi:10.1371/journal.pcbi.1002566
PMCID: PMC3386165  PMID: 22761558
20.  RamiGO: an R/Bioconductor package providing an AmiGO Visualize interface 
Bioinformatics  2013;29(5):666-668.
Summary: The R/Bioconductor package RamiGO is an R interface to AmiGO that enables visualization of Gene Ontology (GO) trees. Given a list of GO terms, RamiGO uses the AmiGO visualize API to import Graphviz-DOT format files into R, and export these either as images (SVG, PNG) or into Cytoscape for extended network analyses. RamiGO provides easy customization of annotation, highlighting of specific GO terms, colouring of terms by P-value or export of a simplified summary GO tree. We illustrate RamiGO functionalities in a genome-wide gene set analysis of prognostic genes in breast cancer.
Availability and implementation: RamiGO is provided in R/Bioconductor, is open source under the Artistic-2.0 License and is available with a user manual containing installation, operating instructions and tutorials. It requires R version 2.15.0 or higher. URL: http://bioconductor.org/packages/release/bioc/html/RamiGO.html
Contact: markus.schroeder@ucdconnect.ie
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts708
PMCID: PMC3582261  PMID: 23297033
21.  Scaffold Topologies II: Analysis of Chemical Databases 
We have systematically enumerated graph representations of scaffold topologies for up to 8-ring molecules and 4-valence atoms, thus providing coverage of the lower portion of the chemical space of small molecules (Pollock et al.1). Here, we examine scaffold topology distributions for several databases: ChemNavigator and PubChem for commercially available chemicals, the Dictionary of Natural Products, a set of 2,742 launched drugs, WOMBAT, a database of medicinal chemistry compounds, and two subsets of PubChem, “actives” and DSSTox comprising toxic substances. We also examined a virtual database of exhaustively enumerated small organic molecules, GDB,2 and contrast the scaffold topology distribution from these collections to the complete coverage of up to 8-ring molecules. For reasons related, perhaps, to synthetic accessibility and complexity, scaffolds exhibiting 6 rings or more are poorly represented. Among all collections examined, PubChem has the greatest scaffold topological diversity, whereas GDB is the most limited. More than 50% of all entries (13,000,000+ actual and 13,000,000+ virtual compounds) exhibit only 8 distinct topologies, one of which is the non-scaffold topology that represents all treelike structures. However, most of the topologies are represented by a single or very small number of examples. Within topologies, we found that 3-way scaffold connections (3-nodes) are much more frequent compared to 4-way (4-node) connections. Fused rings have a slightly higher frequency in biologically oriented databases. Scaffold topologies can be the first step toward an efficient coarse-grained classification scheme of the molecules found in chemical databases.
doi:10.1021/ci700342h
PMCID: PMC2807378  PMID: 18605681
22.  THINK Back: KNowledge-based Interpretation of High Throughput data 
BMC Bioinformatics  2012;13(Suppl 2):S4.
Results of high throughput experiments can be challenging to interpret. Current approaches have relied on bulk processing the set of expression levels, in conjunction with easily obtained external evidence, such as co-occurrence. While such techniques can be used to reason probabilistically, they are not designed to shed light on what any individual gene, or a network of genes acting together, may be doing. Our belief is that today we have the information extraction ability and the computational power to perform more sophisticated analyses that consider the individual situation of each gene. The use of such techniques should lead to qualitatively superior results.
The specific aim of this project is to develop computational techniques to generate a small number of biologically meaningful hypotheses based on observed results from high throughput microarray experiments, gene sequences, and next-generation sequences. Through the use of relevant known biomedical knowledge, as represented in published literature and public databases, we can generate meaningful hypotheses that will aide biologists to interpret their experimental data.
We are currently developing novel approaches that exploit the rich information encapsulated in biological pathway graphs. Our methods perform a thorough and rigorous analysis of biological pathways, using complex factors such as the topology of the pathway graph and the frequency in which genes appear on different pathways, to provide more meaningful hypotheses to describe the biological phenomena captured by high throughput experiments, when compared to other existing methods that only consider partial information captured by biological pathways.
doi:10.1186/1471-2105-13-S2-S4
PMCID: PMC3375631  PMID: 22536867
23.  Information Flow Analysis of Interactome Networks 
PLoS Computational Biology  2009;5(4):e1000350.
Recent studies of cellular networks have revealed modular organizations of genes and proteins. For example, in interactome networks, a module refers to a group of interacting proteins that form molecular complexes and/or biochemical pathways and together mediate a biological process. However, it is still poorly understood how biological information is transmitted between different modules. We have developed information flow analysis, a new computational approach that identifies proteins central to the transmission of biological information throughout the network. In the information flow analysis, we represent an interactome network as an electrical circuit, where interactions are modeled as resistors and proteins as interconnecting junctions. Construing the propagation of biological signals as flow of electrical current, our method calculates an information flow score for every protein. Unlike previous metrics of network centrality such as degree or betweenness that only consider topological features, our approach incorporates confidence scores of protein–protein interactions and automatically considers all possible paths in a network when evaluating the importance of each protein. We apply our method to the interactome networks of Saccharomyces cerevisiae and Caenorhabditis elegans. We find that the likelihood of observing lethality and pleiotropy when a protein is eliminated is positively correlated with the protein's information flow score. Even among proteins of low degree or low betweenness, high information scores serve as a strong predictor of loss-of-function lethality or pleiotropy. The correlation between information flow scores and phenotypes supports our hypothesis that the proteins of high information flow reside in central positions in interactome networks. We also show that the ranks of information flow scores are more consistent than that of betweenness when a large amount of noisy data is added to an interactome. Finally, we combine gene expression data with interaction data in C. elegans and construct an interactome network for muscle-specific genes. We find that genes that rank high in terms of information flow in the muscle interactome network but not in the entire network tend to play important roles in muscle function. This framework for studying tissue-specific networks by the information flow model can be applied to other tissues and other organisms as well.
Author Summary
Protein–protein interactions mediate numerous biological processes. In the last decade, there have been efforts to comprehensively map protein–protein interactions occurring in an organism. The interaction data generated from these high-throughput projects can be represented as interconnected networks. It has been found that knockouts of proteins residing in topologically central positions in the networks more likely result in lethality of the organism than knockouts of peripheral proteins. However, it is difficult to accurately define topologically central proteins because high-throughput data is error-prone and some interactions are not as reliable as others. In addition, the architecture of interaction networks varies in different tissues for multi-cellular organisms. To this end, we present a novel computational approach to identify central proteins while considering the confidence of data and gene expression in tissues. Moreover, our approach takes into account multiple alternative paths in interaction networks. We apply our method to yeast and nematode interaction networks. We find that the likelihood of observing lethality and pleiotropy when a given protein is eliminated correlates better with our centrality score for that protein than with its scores based on traditional centrality metrics. Finally, we set up a framework to identify central proteins in tissue-specific interaction networks.
doi:10.1371/journal.pcbi.1000350
PMCID: PMC2685719  PMID: 19503817
24.  A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network 
BMC Bioinformatics  2010;11:343.
Background
Genetic interaction profiles are highly informative and helpful for understanding the functional linkages between genes, and therefore have been extensively exploited for annotating gene functions and dissecting specific pathway structures. However, our understanding is rather limited to the relationship between double concurrent perturbation and various higher level phenotypic changes, e.g. those in cells, tissues or organs. Modifier screens, such as synthetic genetic arrays (SGA) can help us to understand the phenotype caused by combined gene mutations. Unfortunately, exhaustive tests on all possible combined mutations in any genome are vulnerable to combinatorial explosion and are infeasible either technically or financially. Therefore, an accurate computational approach to predict genetic interaction is highly desirable, and such methods have the potential of alleviating the bottleneck on experiment design.
Results
In this work, we introduce a computational systems biology approach for the accurate prediction of pairwise synthetic genetic interactions (SGI). First, a high-coverage and high-precision functional gene network (FGN) is constructed by integrating protein-protein interaction (PPI), protein complex and gene expression data; then, a graph-based semi-supervised learning (SSL) classifier is utilized to identify SGI, where the topological properties of protein pairs in weighted FGN is used as input features of the classifier. We compare the proposed SSL method with the state-of-the-art supervised classifier, the support vector machines (SVM), on a benchmark dataset in S. cerevisiae to validate our method's ability to distinguish synthetic genetic interactions from non-interaction gene pairs. Experimental results show that the proposed method can accurately predict genetic interactions in S. cerevisiae (with a sensitivity of 92% and specificity of 91%). Noticeably, the SSL method is more efficient than SVM, especially for very small training sets and large test sets.
Conclusions
We developed a graph-based SSL classifier for predicting the SGI. The classifier employs topological properties of weighted FGN as input features and simultaneously employs information induced from labelled and unlabelled data. Our analysis indicates that the topological properties of weighted FGN can be employed to accurately predict SGI. Also, the graph-based SSL method outperforms the traditional standard supervised approach, especially when used with small training sets. The proposed method can alleviate experimental burden of exhaustive test and provide a useful guide for the biologist in narrowing down the candidate gene pairs with SGI. The data and source code implementing the method are available from the website: http://home.ustc.edu.cn/~yzh33108/GeneticInterPred.htm
doi:10.1186/1471-2105-11-343
PMCID: PMC2909217  PMID: 20573270
25.  A Genomewide Functional Network for the Laboratory Mouse 
PLoS Computational Biology  2008;4(9):e1000165.
Establishing a functional network is invaluable to our understanding of gene function, pathways, and systems-level properties of an organism and can be a powerful resource in directing targeted experiments. In this study, we present a functional network for the laboratory mouse based on a Bayesian integration of diverse genetic and functional genomic data. The resulting network includes probabilistic functional linkages among 20,581 protein-coding genes. We show that this network can accurately predict novel functional assignments and network components and present experimental evidence for predictions related to Nanog homeobox (Nanog), a critical gene in mouse embryonic stem cell pluripotency. An analysis of the global topology of the mouse functional network reveals multiple biologically relevant systems-level features of the mouse proteome. Specifically, we identify the clustering coefficient as a critical characteristic of central modulators that affect diverse pathways as well as genes associated with different phenotype traits and diseases. In addition, a cross-species comparison of functional interactomes on a genomic scale revealed distinct functional characteristics of conserved neighborhoods as compared to subnetworks specific to higher organisms. Thus, our global functional network for the laboratory mouse provides the community with a key resource for discovering protein functions and novel pathway components as well as a tool for exploring systems-level topological and evolutionary features of cellular interactomes. To facilitate exploration of this network by the biomedical research community, we illustrate its application in function and disease gene discovery through an interactive, Web-based, publicly available interface at http://mouseNET.princeton.edu.
Author Summary
Functionally related proteins interact in diverse ways to carry out biological processes, and each protein often participates in multiple pathways. Proteins are therefore organized into a complex network through which different functions of the cell are carried out. An accurate description of such a network is invaluable to our understanding of both the system-level features of a cell and those of an individual biological process. In this study, we used a probabilistic model to combine information from diverse genome-scale studies as well as individual investigations to generate a global functional network for mouse. Our analysis of the global topology of this network reveals biologically relevant systems-level characteristics of the mouse proteome, including conservation of functional neighborhoods and network features characteristic of known disease genes and key transcriptional regulators. We have made this network publicly available for search and dynamic exploration by researchers in the community. Our Web interface enables users to easily generate hypotheses regarding potential functional roles of uncharacterized proteins, investigate possible links between their proteins of interest and disease, and identify new players in specific biological processes.
doi:10.1371/journal.pcbi.1000165
PMCID: PMC2527685  PMID: 18818725

Results 1-25 (1290711)