Motivation: KEGG PATHWAY is a service of Kyoto Encyclopedia of Genes and Genomes (KEGG), constructing manually curated pathway maps that represent current knowledge on biological networks in graph models. While valuable graph tools have been implemented in R/Bioconductor, to our knowledge there is currently no software package to parse and analyze KEGG pathways with graph theory.
Results: We introduce the software package KEGGgraph in R and Bioconductor, an interface between KEGG pathways and graph models as well as a collection of tools for these graphs. Superior to existing approaches, KEGGgraph captures the pathway topology and allows further analysis or dissection of pathway graphs. We demonstrate the use of the package by the case study of analyzing human pancreatic cancer pathway.
Availability:KEGGgraph is freely available at the Bioconductor web site (http://www.bioconductor.org). KGML files can be downloaded from KEGG FTP site (ftp://ftp.genome.jp/pub/kegg/xml).
Supplementary information: Supplementary data are available at Bioinformatics online.
The increasing availability and diversity of omics data in the post-genomic era offers new perspectives in most areas of biomedical research. Graph-based biological networks models capture the topology of the functional relationships between molecular entities such as gene, protein and small compounds and provide a suitable framework for integrating and analyzing omics-data. The development of software tools capable of integrating data from different sources and to provide flexible methods to reconstruct, represent and analyze topological networks is an active field of research in bioinformatics.
BisoGenet is a multi-tier application for visualization and analysis of biomolecular relationships. The system consists of three tiers. In the data tier, an in-house database stores genomics information, protein-protein interactions, protein-DNA interactions, gene ontology and metabolic pathways. In the middle tier, a global network is created at server startup, representing the whole data on bioentities and their relationships retrieved from the database. The client tier is a Cytoscape plugin, which manages user input, communication with the Web Service, visualization and analysis of the resulting network.
BisoGenet is able to build and visualize biological networks in a fast and user-friendly manner. A feature of Bisogenet is the possibility to include coding relations to distinguish between genes and their products. This feature could be instrumental to achieve a finer grain representation of the bioentities and their relationships. The client application includes network analysis tools and interactive network expansion capabilities. In addition, an option is provided to allow other networks to be converted to BisoGenet. This feature facilitates the integration of our software with other tools available in the Cytoscape platform. BisoGenet is available at http://bio.cigb.edu.cu/bisogenet-cytoscape/.
This paper presents the R/Bioconductor package minet (version 1.1.6) which provides a set of functions to infer mutual information networks from a dataset. Once fed with a microarray dataset, the package returns a network where nodes denote genes, edges model statistical dependencies between genes and the weight of an edge quantifies the statistical evidence of a specific (e.g transcriptional) gene-to-gene interaction. Four different entropy estimators are made available in the package minet (empirical, Miller-Madow, Schurmann-Grassberger and shrink) as well as four different inference methods, namely relevance networks, ARACNE, CLR and MRNET. Also, the package integrates accuracy assessment tools, like F-scores, PR-curves and ROC-curves in order to compare the inferred network with a reference one.
The package minet provides a series of tools for inferring transcriptional networks from microarray data. It is freely available from the Comprehensive R Archive Network (CRAN) as well as from the Bioconductor website.
Visualization and analysis of molecular networks are both central to systems biology. However, there still exists a large technological gap between them, especially when assessing multiple network levels or hierarchies. Here we present RedeR, an R/Bioconductor package combined with a Java core engine for representing modular networks. The functionality of RedeR is demonstrated in two different scenarios: hierarchical and modular organization in gene co-expression networks and nested structures in time-course gene expression subnetworks. Our results demonstrate RedeR as a new framework to deal with the multiple network levels that are inherent to complex biological systems. RedeR is available from http://bioconductor.org/packages/release/bioc/html/RedeR.html.
Objectives: The aim of present study was to find genetic pathways activated during infection with bacterial meningitis (BM) and potentially influencing the course of the infection using genome-wide RNA expression profiling combined with pathway analysis and functional annotation of the differential transcription.
Methods: We analyzed 21 patients with BM hospitalized in 2008. The control group consisted of 18 healthy subjects. The RNA was extracted from whole blood, globin mRNA was depleted and gene expression profiling was performed using GeneChip Human Gene 1.0 ST Arrays which can assess the transcription of 28,869 genes. Gene expression profile data were analyzed using Bioconductor packages and Bayesian modeling. Functional annotation of the enriched gene sets was used to define the altered genetic networks. We also analyzed whether gene expression profiles depend on the clinical course and outcome. In order to verify the microarray results, the expression levels of ten functionally relevant genes with high statistical significance (CD177, IL1R2, IL18R1, IL18RAP, OLFM4, TLR5, CPA3, FCER1A, IL5RA, and IL7R) were confirmed by quantitative real-time (qRT) PCR.
Results: There were 8569 genes displaying differential expression at a significance level of p < 0.05. Following False Discovery Rate (FDR) correction, a total of 5500 genes remained significant at a p-value of < 0.01. Quantitative RT-PCR confirmed the differential expression in 10 selected genes. Functional annotation and network analysis indicated that most of the genes were related to activation of humoral and cellular immune responses (enrichment score 43). Those changes were found in both adults and in children with BM compared to the healthy controls. The gene expression profiles did not significantly depend on the clinical outcome, but there was a strong influence of the specific type of pathogen underlying BM.
Conclusion: This study demonstrates that there is a very strong activation of immune response at the transcriptional level during BM and that the type of pathogen influences this transcriptional activation.
bacterial meningitis; gene expression profiling; gene networks
The ultimate aim of systems biology is to understand and describe how molecular components interact to manifest collective behaviour that is the sum of the single parts. Building a network of molecular interactions is the basic step in modelling a complex entity such as the cell. Even if gene-gene interactions only partially describe real networks because of post-transcriptional modifications and protein regulation, using microarray technology it is possible to combine measurements for thousands of genes into a single analysis step that provides a picture of the cell's gene expression. Several databases provide information about known molecular interactions and various methods have been developed to infer gene networks from expression data. However, network topology alone is not enough to perform simulations and predictions of how a molecular system will respond to perturbations. Rules for interactions among the single parts are needed for a complete definition of the network behaviour. Another interesting question is how to integrate information carried by the network topology, which can be derived from the literature, with large-scale experimental data.
Here we propose an algorithm, called inference of regulatory interaction schema (IRIS), that uses an iterative approach to map gene expression profile values (both steady-state and time-course) into discrete states and a simple probabilistic method to infer the regulatory functions of the network. These interaction rules are integrated into a factor graph model. We test IRIS on two synthetic networks to determine its accuracy and compare it to other methods. We also apply IRIS to gene expression microarray data for the Saccharomyces cerevisiae cell cycle and for human B-cells and compare the results to literature findings.
IRIS is a rapid and efficient tool for the inference of regulatory relations in gene networks. A topological description of the network and a matrix of gene expression profiles are required as input to the algorithm. IRIS maps gene expression data onto discrete values and then computes regulatory functions as conditional probability tables. The suitability of the method is demonstrated for synthetic data and microarray data. The resulting network can also be embedded in a factor graph model.
Complex genetic disorders often involve products of multiple genes acting cooperatively. Hence, the pathophenotype is the outcome of the perturbations in the underlying pathways, where gene products cooperate through various mechanisms such as protein-protein interactions. Pinpointing the decisive elements of such disease pathways is still challenging. Over the last years, computational approaches exploiting interaction network topology have been successfully applied to prioritize individual genes involved in diseases. Although linkage intervals provide a list of disease-gene candidates, recent genome-wide studies demonstrate that genes not associated with any known linkage interval may also contribute to the disease phenotype. Network based prioritization methods help highlighting such associations. Still, there is a need for robust methods that capture the interplay among disease-associated genes mediated by the topology of the network. Here, we propose a genome-wide network-based prioritization framework named GUILD. This framework implements four network-based disease-gene prioritization algorithms. We analyze the performance of these algorithms in dozens of disease phenotypes. The algorithms in GUILD are compared to state-of-the-art network topology based algorithms for prioritization of genes. As a proof of principle, we investigate top-ranking genes in Alzheimer's disease (AD), diabetes and AIDS using disease-gene associations from various sources. We show that GUILD is able to significantly highlight disease-gene associations that are not used a priori. Our findings suggest that GUILD helps to identify genes implicated in the pathology of human disorders independent of the loci associated with the disorders.
attract is a knowledge-driven analytical approach for identifying and annotating the gene-sets that best discriminate between cell phenotypes. attract finds distinguishing patterns within pathways, decomposes pathways into meta-genes representative of these patterns, and then generates synexpression groups of highly correlated genes from the entire transcriptome dataset. attract can be applied to a wide range of biological systems and is freely available as a Bioconductor package and has been incorporated into the MeV software system.
The nuclear receptors are a large family of eukaryotic transcription factors that constitute major pharmacological targets. They exert their combinatorial control through homotypic heterodimerisation. Elucidation of this dimerisation network is vital in order to understand the complex dynamics and potential cross-talk involved.
Phylogeny, protein-protein interactions, protein-DNA interactions and gene expression data have been integrated to provide a comprehensive and up-to-date description of the topology and properties of the nuclear receptor interaction network in humans. We discriminate between DNA-binding and non-DNA-binding dimers, and provide a comprehensive interaction map, that identifies potential cross-talk between the various pathways of nuclear receptors.
We infer that the topology of this network is hub-based, and much more connected than previously thought. The hub-based topology of the network and the wide tissue expression pattern of NRs create a highly competitive environment for the common heterodimerising partners. Furthermore, a significant number of negative feedback loops is present, with the hub protein SHP [NR0B2] playing a major role. We also compare the evolution, topology and properties of the nuclear receptor network with the hub-based dimerisation network of the bHLH transcription factors in order to identify both unique themes and ubiquitous properties in gene regulation. In terms of methodology, we conclude that such a comprehensive picture can only be assembled by semi-automated text-mining, manual curation and integration of data from various sources.
Summary: TopoGSA (Topology-based Gene Set Analysis) is a web-application dedicated to the computation and visualization of network topological properties for gene and protein sets in molecular interaction networks. Different topological characteristics, such as the centrality of nodes in the network or their tendency to form clusters, can be computed and compared with those of known cellular pathways and processes.
Availability: Freely available at http://www.infobiotics.net/topogsa
Contact: firstname.lastname@example.org; email@example.com
In prokaryotic genomes the number of transcriptional regulators is known to be
proportional to the square of the total number of protein-coding genes. A
toolbox model of evolution was recently proposed to explain this empirical
scaling for metabolic enzymes and their regulators. According to its rules, the
metabolic network of an organism evolves by horizontal transfer of pathways from
other species. These pathways are part of a larger “universal”
network formed by the union of all species-specific networks. It remained to be
understood, however, how the topological properties of this universal network
influence the scaling law of functional content of genomes in the toolbox model.
Here we answer this question by first analyzing the scaling properties of the
toolbox model on arbitrary tree-like universal networks. We prove that critical
branching topology, in which the average number of upstream neighbors of a node
is equal to one, is both necessary and sufficient for quadratic scaling. We
further generalize the rules of the model to incorporate reactions with multiple
substrates/products as well as branched and cyclic metabolic pathways. To
achieve its metabolic tasks, the new model employs evolutionary optimized
pathways with minimal number of reactions. Numerical simulations of this
realistic model on the universal network of all reactions in the KEGG database
produced approximately quadratic scaling between the number of regulated
pathways and the size of the metabolic network. To quantify the geometrical
structure of individual pathways, we investigated the relationship between their
number of reactions, byproducts, intermediate, and feedback metabolites. Our
results validate and explain the ubiquitous appearance of the quadratic scaling
for a broad spectrum of topologies of underlying universal metabolic networks.
They also demonstrate why, in spite of “small-world” topology,
real-life metabolic networks are characterized by a broad distribution of
pathway lengths and sizes of metabolic regulons in regulatory networks.
It has been previously reported that in prokaryotic genomes the number of
transcriptional regulators is proportional to the square of the total number of
genes. We recently offered a general explanation of this empirical powerlaw
scaling in terms of the “toolbox” model in which metabolic and
regulatory networks co-evolve together. This evolution is driven by horizontal
gene transfer of co-regulated metabolic pathways from other species. These
pathways are part of a larger “universal” network formed by the
union of all species-specific networks. In the present work we address the
question of how topological properties of this universal network influence the
powerlaw scaling of regulators in the toolbox model. We also generalize its
rules to include reactions with multiple substrates and products, branched and
cyclic metabolic pathways, and to account for optimality of metabolic pathways.
The main conclusion of our analytical and numerical modeling efforts is that the
quadratic scaling is the robust feature of the toolbox model in a broad range of
universal network topologies. They also demonstrate why, in spite of
“small-world” topology, real-life metabolic networks are
characterized by a broad distribution of pathway lengths and sizes of regulons
in regulatory networks.
By integrating data from comparative genomics and large-scale deletion studies, we previously proposed a minimal gene set comprising 206 protein-coding genes. To evaluate the consistency of the metabolism encoded by such a minimal genome, we have carried out a series of computational analyses. Firstly, the topology of the minimal metabolism was compared with that of the reconstructed networks from natural bacterial genomes. Secondly, the robustness of the metabolic network was evaluated by simulated mutagenesis and, finally, the stoichiometric consistency was assessed by automatically deriving the steady-state solutions from the reaction set. The results indicated that the proposed minimal metabolism presents stoichiometric consistency and that it is organized as a complex power-law network with topological parameters falling within the expected range for a natural metabolism of its size. The robustness analyses revealed that most random mutations do not alter the topology of the network significantly, but do cause significant damage by preventing the synthesis of several compounds or compromising the stoichiometric consistency of the metabolism. The implications that these results have on the origins of metabolic complexity and the theoretical design of an artificial minimal cell are discussed.
minimal genome; metabolic inference; elementary flux mode; scale-free network; network topology; power law
Nested effects models (NEMs) are a class of probabilistic models that were designed to reconstruct a hidden signalling structure from a large set of observable effects caused by active interventions into the signalling pathway. We give a more flexible formulation of NEMs in the language of Bayesian networks. Our framework constitutes a natural generalization of the original NEM model, since it explicitly states the assumptions that are tacitly underlying the original version. Our approach gives rise to new learning methods for NEMs, which have been implemented in the /Bioconductor package nem. We validate these methods in a simulation study and apply them to a synthetic lethality dataset in yeast.
Complex networks abound in physical, biological and social sciences. Quantifying a network’s topological structure facilitates network exploration and analysis, and network comparison, clustering and classification. A number of Wiener type indices have recently been incorporated as distance-based descriptors of complex networks, such as the R package QuACN. Wiener type indices are known to depend both on the network’s number of nodes and topology. To apply these indices to measure similarity of networks of different numbers of nodes, normalization of these indices is needed to correct the effect of the number of nodes in a network. This paper aims to fill this gap. Moreover, we introduce an -Wiener index of network , denoted by . This notion generalizes the Wiener index to a very wide class of Wiener type indices including all known Wiener type indices. We identify the maximum and minimum of over a set of networks with nodes. We then introduce our normalized-version of -Wiener index. The normalized -Wiener indices were demonstrated, in a number of experiments, to improve significantly the hierarchical clustering over the non-normalized counterparts.
Recently, computational approaches integrating copy number aberrations (CNAs) and gene expression (GE) have been extensively studied to identify cancer-related genes and pathways. In this work, we integrate these two data sets with protein-protein interaction (PPI) information to find cancer-related functional modules. To integrate CNA and GE data, we first built a gene-gene relationship network from a set of seed genes by enumerating all types of pairwise correlations, e.g. GE-GE, CNA-GE, and CNA-CNA, over multiple patients. Next, we propose a voting-based cancer module identification algorithm by combining topological and data-driven properties (VToD algorithm) by using the gene-gene relationship network as a source of data-driven information, and the PPI data as topological information. We applied the VToD algorithm to 266 glioblastoma multiforme (GBM) and 96 ovarian carcinoma (OVC) samples that have both expression and copy number measurements, and identified 22 GBM modules and 23 OVC modules. Among 22 GBM modules, 15, 12, and 20 modules were significantly enriched with cancer-related KEGG, BioCarta pathways, and GO terms, respectively. Among 23 OVC modules, 19, 18, and 23 modules were significantly enriched with cancer-related KEGG, BioCarta pathways, and GO terms, respectively. Similarly, we also observed that 9 and 2 GBM modules and 15 and 18 OVC modules were enriched with cancer gene census (CGC) and specific cancer driver genes, respectively. Our proposed module-detection algorithm significantly outperformed other existing methods in terms of both functional and cancer gene set enrichments. Most of the cancer-related pathways from both cancer data sets found in our algorithm contained more than two types of gene-gene relationships, showing strong positive correlations between the number of different types of relationship and CGC enrichment -values (0.64 for GBM and 0.49 for OVC). This study suggests that identified modules containing both expression changes and CNAs can explain cancer-related activities with greater insights.
Microarray data repositories as well as large clinical applications of gene expression allow to analyse several hundreds of microarrays at one time. The preprocessing of large amounts of microarrays is still a challenge. The algorithms are limited by the available computer hardware. For example, building classification or prognostic rules from large microarray sets will be very time consuming. Here, preprocessing has to be a part of the cross-validation and resampling strategy which is necessary to estimate the rule’s prediction quality honestly.
This paper proposes the new Bioconductor package affyPara for parallelized preprocessing of Affymetrix microarray data. Partition of data can be applied on arrays and parallelization of algorithms is a straightforward consequence. The partition of data and distribution to several nodes solves the main memory problems and accelerates preprocessing by up to the factor 20 for 200 or more arrays.
affyPara is a free and open source package, under GPL license, available form the Bioconductor project at www.bioconductor.org. A user guide and examples are provided with the package.
parallel computing; R; microarray; preprocessing; normalization
Network methods are increasingly used to represent the interactions of genes and/or proteins. Genes or proteins that are directly linked may have a similar biological function or may be part of the same biological pathway. Since the information on the connection (adjacency) between 2 nodes may be noisy or incomplete, it can be desirable to consider alternative measures of pairwise interconnectedness. Here we study a class of measures that are proportional to the number of neighbors that a pair of nodes share in common. For example, the topological overlap measure by Ravasz et al.  can be interpreted as a measure of agreement between the m = 1 step neighborhoods of 2 nodes. Several studies have shown that two proteins having a higher topological overlap are more likely to belong to the same functional class than proteins having a lower topological overlap. Here we address the question whether a measure of topological overlap based on higher-order neighborhoods could give rise to a more robust and sensitive measure of interconnectedness.
We generalize the topological overlap measure from m = 1 step neighborhoods to m ≥ 2 step neighborhoods. This allows us to define the m-th order generalized topological overlap measure (GTOM) by (i) counting the number of m-step neighbors that a pair of nodes share and (ii) normalizing it to take a value between 0 and 1. Using theoretical arguments, a yeast co-expression network application, and a fly protein network application, we illustrate the usefulness of the proposed measure for module detection and gene neighborhood analysis.
Topological overlap can serve as an important filter to counter the effects of spurious or missing connections between network nodes. The m-th order topological overlap measure allows one to trade-off sensitivity versus specificity when it comes to defining pairwise interconnectedness and network modules.
Biological networks tend to have high interconnectivity, complex topologies and multiple types of interactions. This renders difficult the identification of sub-networks that are involved in condition- specific responses. In addition, we generally lack scalable methods that can reveal the information flow in gene regulatory and biochemical pathways. Doing so will help us to identify key participants and paths under specific environmental and cellular context.
This paper introduces the theory of network flooding, which aims to address the problem of network minimization and regulatory information flow in gene regulatory networks. Given a regulatory biological network, a set of source (input) nodes and optionally a set of sink (output) nodes, our task is to find (a) the minimal sub-network that encodes the regulatory program involving all input and output nodes and (b) the information flow from the source to the sink nodes of the network. Here, we describe a novel, scalable, network traversal algorithm and we assess its potential to achieve significant network size reduction in both synthetic and E. coli networks. Scalability and sensitivity analysis show that the proposed method scales well with the size of the network, and is robust to noise and missing data.
The method of network flooding proves to be a useful, practical approach towards information flow analysis in gene regulatory networks. Further extension of the proposed theory has the potential to lead in a unifying framework for the simultaneous network minimization and information flow analysis across various “omics” levels.
Network flood; Network flux; Information flow; Gene regulatory networks; Network minimization
Cellular processes and pathways, whose deregulation may contribute to the development of cancers, are often represented as cascades of proteins transmitting a signal from the cell surface to the nucleus. However, recent functional genomic experiments have identified thousands of interactions for the signalling canonical proteins, challenging the traditional view of pathways as independent functional entities. Combining information from pathway databases and interaction networks obtained from functional genomic experiments is therefore a promising strategy to obtain more robust pathway and process representations, facilitating the study of cancer-related pathways.
We present a methodology for extending pre-defined protein sets representing cellular pathways and processes by mapping them onto a protein-protein interaction network, and extending them to include densely interconnected interaction partners. The added proteins display distinctive network topological features and molecular function annotations, and can be proposed as putative new components, and/or as regulators of the communication between the different cellular processes. Finally, these extended pathways and processes are used to analyse their enrichment in pancreatic mutated genes. Significant associations between mutated genes and certain processes are identified, enabling an analysis of the influence of previously non-annotated cancer mutated genes.
The proposed method for extending cellular pathways helps to explain the functions of cancer mutated genes by exploiting the synergies of canonical knowledge and large-scale interaction data.
Many signaling systems show adaptation—the ability to reset themselves after responding to a stimulus. We computationally searched all possible three-node enzyme network topologies to identify those that could perform adaptation. Only two major core topologies emerge as robust solutions: a negative feedback loop with a buffering node and an incoherent feedforward loop with a proportioner node. Minimal circuits containing these topologies are, within proper regions of parameter space, sufficient to achieve adaptation. Morecomplex circuits that robustly performadaptation all contain at least one of these topologies at their core. This analysis yields a design table highlighting a finite set of adaptive circuits. Despite the diversity of possible biochemical networks, it may be common to find that only a finite set of core topologies can execute a particular function. These design rules provide a framework for functionally classifying complex natural networks and a manual for engineering networks.
For a video summary of this article, see the PaperFlick file with the Supplemental Data available online.
Summary: Coordinated Gene Activity in Pattern Sets (CoGAPS) provides an integrated package for isolating gene expression driven by a biological process, enhancing inference of biological processes from transcriptomic data. CoGAPS improves on other enrichment measurement methods by combining a Markov chain Monte Carlo (MCMC) matrix factorization algorithm (GAPS) with a threshold-independent statistic inferring activity on gene sets. The software is provided as open source C++ code built on top of JAGS software with an R interface.
Availability: The R package CoGAPS and the C++ package GAPS-JAGS are provided open source under the GNU Lesser Public License (GLPL) with a users manual containing installation and operating instructions. CoGAPS is available through Bioconductor and depends on the rjags package available through CRAN to interface CoGAPS with GAPS-JAGS.
Contact: firstname.lastname@example.org; email@example.com
Supplementary Information: Supplementary data is available at Bioinformatics online.
Deregulation between two different cell populations manifests itself in changing gene expression patterns and changing regulatory interactions. Accumulating knowledge about biological networks creates an opportunity to study these changes in their cellular context.
We analyze re-wiring of regulatory networks based on cell population-specific perturbation data and knowledge about signaling pathways and their target genes. We quantify deregulation by merging regulatory signal from the two cell populations into one score. This joint approach, called JODA, proves advantageous over separate analysis of the cell populations and analysis without incorporation of knowledge. JODA is implemented and freely available in a Bioconductor package 'joda'.
Using JODA, we show wide-spread re-wiring of gene regulatory networks upon neocarzinostatin-induced DNA damage in Human cells. We recover 645 deregulated genes in thirteen functional clusters performing the rich program of response to damage. We find that the clusters contain many previously characterized neocarzinostatin target genes. We investigate connectivity between those genes, explaining their cooperation in performing the common functions. We review genes with the most extreme deregulation scores, reporting their involvement in response to DNA damage. Finally, we investigate the indirect impact of the ATM pathway on the deregulated genes, and build a hypothetical hierarchy of direct regulation. These results prove that JODA is a step forward to a systems level, mechanistic understanding of changes in gene regulation between different cell populations.
Gene-list annotations are critical for researchers to explore the complex relationships between genes and functionalities. Currently, the annotations of a gene list are usually summarized by a table or a barplot. As such, potentially biologically important complexities such as one gene belonging to multiple annotation categories are difficult to extract. We have devised explicit and efficient visualization methods that provide intuitive methods for interrogating the intrinsic connections between biological categories and genes.
We have constructed a data model and now present two novel methods in a Bioconductor package, "GeneAnswers", to simultaneously visualize genes, concepts (a.k.a. annotation categories), and concept-gene connections (a.k.a. annotations): the "Concept-and-Gene Network" and the "Concept-and-Gene Cross Tabulation". These methods have been tested and validated with microarray-derived gene lists.
These new visualization methods can effectively present annotations using Gene Ontology, Disease Ontology, or any other user-defined gene annotations that have been pre-associated with an organism's genome by human curation, automated pipelines, or a combination of the two. The gene-annotation data model and associated methods are available in the Bioconductor package called "GeneAnswers " described in this publication.
The Bioconductor project is an “open source and open development software project for the analysis and comprehension of genomic data” Gentleman et al. (2004), primarily based on the
R programming language. Infrastructure packages, such as Biobase, are maintained by Bioconductor core developers, and serve several key roles to the broader community of Bioconductor software developers and users. In particular, Biobase introduces a
S4 class, the
eSet, for high dimensional assay data. Encapsulating the assay data as well as meta-data on the samples, the features, and the experiment in the
eSet class definition ensures propagation of the relevant sample and feature meta-data throughout an analysis. Extending the
eSet class promotes code reuse through inheritance, interoperability with other
R packages, and is less error prone. Recently proposed class definitions for high throughput SNP arrays extend the
eSet class. This chapter highlights the advantages of adopting and extending Biobase class definitions through a working example of one implementation of classes for the analysis of high throughput SNP arrays.
SNP array; copy number; genotype; S4 classes
A metabolic network is the sum of all chemical transformations or reactions in the cell, with the metabolites being interconnected by enzyme-catalyzed reactions. Many enzymes exist in numerous species while others occur only in a few. We ask if there are relationships between the phylogenetic profile of an enzyme, or the number of different bacterial species that contain it, and its topological importance in the metabolic network. Our null hypothesis is that phylogenetic profile is independent of topological importance. To test our null hypothesis we constructed an enzyme network from the KEGG (Kyoto Encyclopedia of Genes and Genomes) database. We calculated three network indices of topological importance: the degree or the number of connections of a network node; closeness centrality, which measures how close a node is to others; and betweenness centrality measuring how frequently a node appears on all shortest paths between two other nodes.
Enzyme phylogenetic profile correlates best with betweenness centrality and also quite closely with degree, but poorly with closeness centrality. Both betweenness and closeness centralities are non-local measures of topological importance and it is intriguing that they have contrasting power of predicting phylogenetic profile in bacterial species. We speculate that redundancy in an enzyme network may be reflected by betweenness centrality but not by closeness centrality. We also discuss factors influencing the correlation between phylogenetic profile and topological importance.
Our analysis falsifies the hypothesis that phylogenetic profile of enzymes is independent of enzyme network importance. Our results show that phylogenetic profile correlates better with degree and betweenness centrality, but less so with closeness centrality. Enzymes that occur in many bacterial species tend to be those that have high network importance. We speculate that this phenomenon originates in mechanisms driving network evolution. Closeness centrality reflects phylogenetic profile poorly. This is because metabolic networks often consist of distinct functional modules and some are not in the centre of the network. Enzymes in these peripheral parts of a network might be important for cell survival and should therefore occur in many bacterial species. They are, however, distant from other enzymes in the same network.