Large-scale evaluation of gene expression variation among Caenorhabditis elegans lines that have diverged from a common ancestor allows for the analysis of a novel class of biological networks – evolutionary gene coexpression networks. Comparative analysis of these evolutionary networks has the potential to uncover the effects of natural selection in shaping coexpression network topologies since C. elegans mutation accumulation (MA) lines evolve essentially free from the effects of natural selection, whereas natural isolate (NI) populations are subject to selective constraints.
We compared evolutionary gene coexpression networks for C. elegans MA lines versus NI populations to evaluate the role that natural selection plays in shaping the evolution of network topologies. MA and NI evolutionary gene coexpression networks were found to have very similar global topological properties as measured by a number of network topological parameters. Observed MA and NI networks show node degree distributions and average values for node degree, clustering coefficient, path length, eccentricity and betweeness that are statistically indistinguishable from one another yet highly distinct from randomly simulated networks. On the other hand, at the local level the MA and NI coexpression networks are highly divergent; pairs of genes coexpressed in the MA versus NI lines are almost entirely different as are the connectivity and clustering properties of individual genes.
It appears that selective forces shape how local patterns of coexpression change over time but do not control the global topology of C. elegans evolutionary gene coexpression networks. These results have implications for the evolutionary significance of global network topologies, which are known to be conserved across disparate complex systems.
While gene duplication is known to be one of the most common mechanisms of genome evolution, the fates of genes after duplication are still being debated. In particular, it is presently unknown whether most duplicate genes preserve (or subdivide) the functions of the parental gene or acquire new functions. One aspect of gene function, that is the expression profile in gene coexpression network, has been largely unexplored for duplicate genes.
Here we build a human gene coexpression network using human tissue-specific microarray data and investigate the divergence of duplicate genes in it. The topology of this network is scale-free. Interestingly, our analysis indicates that duplicate genes rapidly lose shared coexpressed partners: after approximately 50 million years since duplication, the two duplicate genes in a pair have only slightly higher number of shared partners as compared with two random singletons. We also show that duplicate gene pairs quickly acquire new coexpressed partners: the average number of partners for a duplicate gene pair is significantly greater than that for a singleton (the latter number can be used as a proxy of the number of partners for a parental singleton gene before duplication). The divergence in gene expression between two duplicates in a pair occurs asymmetrically: one gene usually has more partners than the other one. The network is resilient to both random and degree-based in silico removal of either singletons or duplicate genes. In contrast, the network is especially vulnerable to the removal of highly connected genes when duplicate genes and singletons are considered together.
Duplicate genes rapidly diverge in their expression profiles in the network and play similar role in maintaining the network robustness as compared with singletons.
Supplementary information: Please see additional files.
Identification of genes that are coexpressed across various tissues and environmental stresses is biologically interesting, since they
may play coordinated role in similar biological processes. Genes with correlated expression patterns can be best identified by using
coexpression network analysis of transcriptome data. In the present study, we analyzed the temporal-spatial coordination of gene
expression in root, leaf and panicle of rice under drought stress and constructed network using WGCNA and Cytoscape. Total of
2199 differentially expressed genes (DEGs) were identified in at least three or more tissues, wherein 88 genes have coordinated
expression profile among all the six tissues under drought stress. These 88 highly coordinated genes were further subjected to
module identification in the coexpression network. Based on chief topological properties we identified 18 hub genes such as ABC
transporter, ATP-binding protein, dehydrin, protein phosphatase 2C, LTPL153 - Protease inhibitor, phosphatidylethanolaminebinding
protein, lactose permease-related, NADP-dependent malic enzyme, etc. Motif enrichment analysis showed the presence of
ABRE cis-elements in the promoters of > 62% of the coordinately expressed genes. Our results suggest that drought stress mediated
upregulated gene expression was coordinated through an ABA-dependent signaling pathway across tissues, at least for the subset
of genes identified in this study, while down regulation appears to be regulated by tissue specific pathways in rice.
Coexpression; Drought stress; Hub gene; Rice; Transcriptome; WGCNA
Toward the goal of understanding system properties of biological networks, we investigate the global and local regulation of gene expression in the Saccharomyces cerevisiae metabolic network. Our results demonstrate predominance of local gene regulation in metabolism. Metabolic genes display significant coexpression on distances smaller than the average network distance, a behavior supported by the distribution of transcription factor binding sites in the metabolic network and genome context associations. Positive gene coexpression decreases monotonically with distance in the network, while negative coexpression is strongest at intermediate network distances. We show that basic topological motifs of the metabolic network exhibit statistically significant differences in coexpression behavior.
expression; genome context; metabolism; motifs; network
Publicly available databases of coexpressed gene sets are a valuable resource for a wide variety of experimental studies, including gene targeting for functional identification, and for investigations of regulatory mechanisms or protein–protein interaction networks. Although coexpressed gene databases are becoming more and more popular in the field of plant biology, those with animal data are rather limited, possibly due to the lower reliability of the coexpression data. The original COXPRESdb (coexpressed gene database) (http://coxpresdb.jp) represented the coexpression relationship for human and mouse. Here, we report updates of this database that especially focus on the enhancement of the reliability of gene coexpression data in animals. For this purpose, we implemented a new comparable coexpression measure, Mutual Rank, included five other animal species, rat, chicken, zebrafish, fly and nematoda, to assess the conservation of coexpression, and added different layers of omics data into the integrated network of genes. Comparison of coexpression is a key concept to enhance the reliability of gene coexpression, and the integration of different information can reduce the noise inherent in the information. With the functions for gene network representation, COXPRESdb can help researchers to clarify the functional and regulatory networks of genes in a broad array of animal species.
The complex organization of connectivity in the human brain is incompletely understood. Recently, topological measures based on graph theory have provided a new approach to quantify large-scale cortical networks. These methods have been applied to anatomical connectivity data on non-human species and cortical networks have been shown to have small-world topology, associated with high local and global efficiency of information transfer. Anatomical networks derived from cortical thickness measurements have shown the same organizational properties of the healthy human brain, consistent with similar results reported in functional networks derived from resting state functional MRI and MEG data. Here we show, using anatomical networks derived from analysis of inter-regional covariation of gray matter volume in magnetic resonance imaging (MRI) data on 259 healthy volunteers, that classical divisions of cortex (multimodal, unimodal and transmodal) have some distinct topological attributes. While all cortical divisions shared non-random properties of small-worldness and efficient wiring (short mean Euclidean distance between connected regions), the multimodal network had a hierarchical organization, dominated by frontal hubs with low clustering, whereas the transmodal network was assortative. Moreover, in a sample of 203 people with schizophrenia, multimodal network organization was abnormal, as indicated by reduced hierarchy, the loss of frontal and the emergence of non-frontal hubs, and increased connection distance. We propose that the topological differences between divisions of normal cortex may represent the outcome of different growth processes for multimodal and transmodal networks; and that neurodevelopmental abnormalities in schizophrenia specifically impact multimodal cortical organization.
anatomy; network; hierarchy; systems; MRI; schizophrenia; neurodevelopment
Network analysis techniques allow a more accurate reflection of underlying systems biology to be realized than traditional unidimensional molecular biology approaches. Here, using gene coexpression network analysis, we define the gene expression network topology of cardiac hypertrophy and failure and the extent of recapitulation of fetal gene expression programs in failing and hypertrophied adult myocardium.
Methods and Results
We assembled all myocardial transcript data in the Gene Expression Omnibus (n = 1617). Since hierarchical analysis revealed species had primacy over disease clustering, we focused this analysis on the most complete (murine) dataset (n = 478). Using gene coexpression network analysis, we derived functional modules, regulatory mediators and higher order topological relationships between genes and identified 50 gene co-expression modules in developing myocardium that were not present in normal adult tissue. We found that known gene expression markers of myocardial adaptation were members of upregulated modules but not hub genes. We identified ZIC2 as a novel transcription factor associated with coexpression modules common to developing and failing myocardium. Of 50 fetal gene co-expression modules, three (6%) were reproduced in hypertrophied myocardium and seven (14%) were reproduced in failing myocardium. One fetal module was common to both failing and hypertrophied myocardium.
Network modeling allows systems analysis of cardiovascular development and disease. While we did not find evidence for a global coordinated program of fetal gene expression in adult myocardial adaptation, our analysis revealed specific gene expression modules active during both development and disease and specific candidates for their regulation.
fetal; gene expression; heart failure; hypertrophy; myocardium
A database of coexpressed gene sets can provide valuable information for a wide variety of experimental designs, such as targeting of genes for functional identification, gene regulation and/or protein–protein interactions. Coexpressed gene databases derived from publicly available GeneChip data are widely used in Arabidopsis research, but platforms that examine coexpression for higher mammals are rather limited. Therefore, we have constructed a new database, COXPRESdb (coexpressed gene database) (http://coxpresdb.hgc.jp), for coexpressed gene lists and networks in human and mouse. Coexpression data could be calculated for 19 777 and 21 036 genes in human and mouse, respectively, by using the GeneChip data in NCBI GEO. COXPRESdb enables analysis of the four types of coexpression networks: (i) highly coexpressed genes for every gene, (ii) genes with the same GO annotation, (iii) genes expressed in the same tissue and (iv) user-defined gene sets. When the networks became too big for the static picture on the web in GO networks or in tissue networks, we used Google Maps API to visualize them interactively. COXPRESdb also provides a view to compare the human and mouse coexpression patterns to estimate the conservation between the two species.
In recent years, the maturation of microarray technology has allowed the genome-wide analysis of gene expression patterns to identify tissue-specific and ubiquitously expressed ('housekeeping') genes. We have performed a functional and topological analysis of housekeeping and tissue-specific networks to identify universally necessary biological processes, and those unique to or characteristic of particular tissues.
We measured whole genome expression in 31 human tissues, identifying 2374 housekeeping genes expressed in all tissues, and genes uniquely expressed in each tissue. Comprehensive functional analysis showed that the housekeeping set is substantially larger than previously thought, and is enriched with vital processes such as oxidative phosphorylation, ubiquitin-dependent proteolysis, translation and energy metabolism. Network topology of the housekeeping network was characterized by higher connectivity and shorter paths between the proteins than the global network. Ontology enrichment scoring and network topology of tissue-specific genes were consistent with each tissue's function and expression patterns clustered together in accordance with tissue origin. Tissue-specific genes were twice as likely as housekeeping genes to be drug targets, allowing the identification of tissue 'signature networks' that will facilitate the discovery of new therapeutic targets and biomarkers of tissue-targeted diseases.
A comprehensive functional analysis of housekeeping and tissue-specific genes showed that the biological function of housekeeping and tissue-specific genes was consistent with tissue origin. Network analysis revealed that tissue-specific networks have distinct network properties related to each tissue's function. Tissue 'signature networks' promise to be a rich source of targets and biomarkers for disease treatment and diagnosis.
Genetic robustness refers to a compensatory mechanism for buffering deleterious mutations or environmental variations. Gene duplication has been shown to provide such functional backups. However, the overall contribution of duplication-based buffering for genetic robustness is rather small. In this study, we investigated whether transcriptional compensation also exists among genes that share similar functions without sequence homology. A set of nonhomologous synthetic-lethal gene pairs was assessed by using a coexpression network, protein-protein interactions, and other types of genetic interactions in yeast. Our results are notably different from those of previous studies on buffering paralogs. The low expression similarity and the conditional coexpression alone do not play roles in identifying the functionally compensatory genes. Additional properties such as synthetic-lethal interaction, the ratio of shared common interacting partners, and the degree of coregulation were, at least in part, necessary to extract functional compensatory genes. Our network-based approach is applicable to select several well-documented cases of compensatory gene pairs and a set of new pairs. The results suggest that transcriptional reprogramming plays a limited role in functional compensation among nonhomologous genes. Our study aids in understanding the mechanism and features of functional compensation more in detail.
High-throughput gene expression data can predict gene function through the “guilt by association” principle: coexpressed genes are likely to be functionally associated.
We analyzed publicly available expression data on normal human tissues. The analysis is based on the integration of data obtained with two experimental platforms (microarrays and SAGE) and of various measures of dissimilarity between expression profiles. The building blocks of the procedure are the Ranked Coexpression Groups (RCG), small sets of tightly coexpressed genes which are analyzed in terms of functional annotation. Functionally characterized RCGs are selected by means of the majority rule and used to predict new functional annotations. Functionally characterized RCGs are enriched in groups of genes associated to similar phenotypes. We exploit this fact to find new candidate disease genes for many OMIM phenotypes of unknown molecular origin.
We predict new functional annotations for many human genes, showing that the integration of different data sets and coexpression measures significantly improves the scope of the results. Combining gene expression data, functional annotation and known phenotype-gene associations we provide candidate genes for several genetic diseases of unknown molecular basis.
Response of cells to changing environmental conditions is governed by the dynamics of intricate biomolecular interactions. It may be reasonable to assume, proteins being the dominant macromolecules that carry out routine cellular functions, that understanding the dynamics of protein∶protein interactions might yield useful insights into the cellular responses. The large-scale protein interaction data sets are, however, unable to capture the changes in the profile of protein∶protein interactions. In order to understand how these interactions change dynamically, we have constructed conditional protein linkages for Escherichia coli by integrating functional linkages and gene expression information. As a case study, we have chosen to analyze UV exposure in wild-type and SOS deficient E. coli at 20 minutes post irradiation. The conditional networks exhibit similar topological properties. Although the global topological properties of the networks are similar, many subtle local changes are observed, which are suggestive of the cellular response to the perturbations. Some such changes correspond to differences in the path lengths among the nodes of carbohydrate metabolism correlating with its loss in efficiency in the UV treated cells. Similarly, expression of hubs under unique conditions reflects the importance of these genes. Various centrality measures applied to the networks indicate increased importance for replication, repair, and other stress proteins for the cells under UV treatment, as anticipated. We thus propose a novel approach for studying an organism at the systems level by integrating genome-wide functional linkages and the gene expression data.
Many cellular processes and the response of cells to environmental cues are determined by the intricate protein∶protein interactions. These cellular protein interactions can be represented in the form of a graph, where the nodes represent the proteins and the edges signify the interactions between them. However, the available protein functional linkage maps do not incorporate the dynamics of gene expression and thus do not portray the dynamics of true protein∶protein interactions in vivo. We have used gene expression data as well as the available protein functional interaction information for Escherichia coli to build the protein interaction networks for expressed genes in a given condition. These networks, named conditional networks, capture the differences in the protein interaction networks and hence the cell physiology. Thus, by exploring the dynamics of protein interaction profiles, we hope to understand the response of cells to environmental changes.
Gene regulatory networks control the global gene expression and the dynamics of protein output in living cells. In multicellular organisms, transcription factors and microRNAs are the major families of gene regulators. Recent studies have suggested that these two kinds of regulators share similar regulatory logics and participate in cooperative activities in the gene regulatory network; however, their combinational regulatory effects and preferences on the protein interaction network remain unclear.
In this study, we constructed a global human gene regulatory network comprising both transcriptional and post-transcriptional regulatory relationships, and integrated the protein interactome into this network. We then screened the integrated network for four types of regulatory motifs: single-regulation, co-regulation, crosstalk, and independent, and investigated their topological properties in the protein interaction network.
Among the four types of network motifs, the crosstalk was found to have the most enriched protein-protein interactions in their downstream regulatory targets. The topological properties of these motifs also revealed that they target crucial proteins in the protein interaction network and may serve important roles of biological functions.
Altogether, these results reveal the combinatorial regulatory patterns of transcription factors and microRNAs on the protein interactome, and provide further evidence to suggest the connection between gene regulatory network and protein interaction network.
Exploring topological properties of human brain network has become an exciting topic in neuroscience research. Large-scale structural and functional brain networks both exhibit a small-world topology, which is evidence for global and local parallel information processing. Meanwhile, resting state networks (RSNs) underlying specific biological functions have provided insights into how intrinsic functional architecture influences cognitive and perceptual information processing. However, topological properties of single RSNs remain poorly understood. Here, we have two hypotheses: i) each RSN also has optimized small-world architecture; ii) topological properties of RSNs related to perceptual and higher cognitive processes are different. To test these hypotheses, we investigated the topological properties of the default-mode, dorsal attention, central-executive, somato-motor, visual and auditory networks derived from resting-state functional magnetic resonance imaging (fMRI). We found small-world topology in each RSN. Furthermore, small-world properties of cognitive networks were higher than those of perceptual networks. Our findings are the first to demonstrate a topological fractionation between perceptual and higher cognitive networks. Our approach may be useful for clinical research, especially for diseases that show selective abnormal connectivity in specific brain networks.
Differential coexpression analysis (DCEA) is increasingly used for investigating the global transcriptional mechanisms underlying phenotypic changes. Current DCEA methods mostly adopt a gene connectivity-based strategy to estimate differential coexpression, which is characterized by comparing the numbers of gene neighbors in different coexpression networks. Although it simplifies the calculation, this strategy mixes up the identities of different coexpression neighbors of a gene, and fails to differentiate significant differential coexpression changes from those trivial ones. Especially, the correlation-reversal is easily missed although it probably indicates remarkable biological significance.
We developed two link-based quantitative methods, DCp and DCe, to identify differentially coexpressed genes and gene pairs (links). Bearing the uniqueness of exploiting the quantitative coexpression change of each gene pair in the coexpression networks, both methods proved to be superior to currently popular methods in simulation studies. Re-mining of a publicly available type 2 diabetes (T2D) expression dataset from the perspective of differential coexpression analysis led to additional discoveries than those from differential expression analysis.
This work pointed out the critical weakness of current popular DCEA methods, and proposed two link-based DCEA algorithms that will make contribution to the development of DCEA and help extend it to a broader spectrum.
Human immunodeficiency virus-1 (HIV-1) in acquired immune deficiency syndrome (AIDS) relies on human host cell proteins in virtually every aspect of its life cycle. Knowledge of the set of interacting human and viral proteins would greatly contribute to our understanding of the mechanisms of infection and subsequently to the design of new therapeutic approaches. This work is the first attempt to predict the global set of interactions between HIV-1 and human host cellular proteins. We propose a supervised learning framework, where multiple information data sources are utilized, including co-occurrence of functional motifs and their interaction domains and protein classes, gene ontology annotations, posttranslational modifications, tissue distributions and gene expression profiles, topological properties of the human protein in the interaction network and the similarity of HIV-1 proteins to human proteins’ known binding partners. We trained and tested a Random Forest (RF) classifier with this extensive feature set. The model’s predictions achieved an average Mean Average Precision (MAP) score of 23%. Among the predicted interactions was for example the pair, HIV-1 protein tat and human vitamin D receptor. This interaction had recently been independently validated experimentally. The rank-ordered lists of predicted interacting pairs are a rich source for generating biological hypotheses. Amongst the novel predictions, transcription regulator activity, immune system process and macromolecular complex were the top most significant molecular function, process and cellular compartments, respectively. Supplementary material is available at URL www.cs.cmu.edu/~oznur/hiv/hivPPI.html
Information regarding gene coexpression is useful to predict gene function. Several databases have been constructed for gene coexpression in model organisms based on a large amount of publicly available gene expression data measured by GeneChip platforms. In these databases, Pearson's correlation coefficients (PCCs) of gene expression patterns are widely used as a measure of gene coexpression. Although the coexpression measure or GeneChip summarization method affects the performance of the gene coexpression database, previous studies for these calculation procedures were tested with only a small number of samples and a particular species. To evaluate the effectiveness of coexpression measures, assessments with large-scale microarray data are required. We first examined characteristics of PCC and found that the optimal PCC threshold to retrieve functionally related genes was affected by the method of gene expression database construction and the target gene function. In addition, we found that this problem could be overcome when we used correlation ranks instead of correlation values. This observation was evaluated by large-scale gene expression data for four species: Arabidopsis, human, mouse and rat.
gene coexpression; Pearson's correlation coefficient; GeneChip summarization; Arabidopsis
Recent large-scale studies of evolutionary changes in gene expression among mammalian species have led to the proposal that gene expression divergence may be neutral with respect to organismic fitness. Here, we employ a comparative analysis of mammalian gene sequence divergence and gene expression divergence to test the hypothesis that the evolution of gene expression is predominantly neutral. Two models of neutral gene expression evolution are considered: 1—purely neutral evolution (i.e., no selective constraint) of gene expression levels and patterns and 2—neutral evolution accompanied by selective constraint. With respect to purely neutral evolution, levels of change in gene expression between human–mouse orthologs are correlated with levels of gene sequence divergence that are determined largely by purifying selection. In contrast, evolutionary changes of tissue-specific gene expression profiles do not show such a correlation with sequence divergence. However, divergence of both gene expression levels and profiles are significantly lower for orthologous human–mouse gene pairs than for pairs of randomly chosen human and mouse genes. These data clearly point to the action of selective constraint on gene expression divergence and are inconsistent with the purely neutral model; however, there is likely to be a neutral component in evolution of gene expression, particularly, in tissues where the expression of a given gene is low and functionally irrelevant. The model of neutral evolution with selective constraint predicts a regular, clock-like accumulation of gene expression divergence. However, relative rate tests of the divergence among human–mouse–rat orthologous gene sets reveal clock-like evolution for gene sequence divergence, and to a lesser extent for gene expression level divergence, but not for the divergence of tissue-specific gene expression profiles. Taken together, these results indicate that gene expression divergence is subject to the effects of purifying selective constraint and suggest that it might also be substantially influenced by positive Darwinian selection.
Molecular evolution; Neutral theory; Human; Mouse; Genomics
High-throughput generation of new types of relational biomedical datasets is creating a demand for methods to provide insights into their complexity. Such networks are often too large to interpret visually and too complicated to be explained solely based on local topological properties.
One way to try to make sense of such complex networks would be to transform them into discernable abstracts, or summaries, of the original networks. Then, important components could become more readily visible. This work presents such an approach for understanding networks via abstraction of global network connectivity using compression. This made possible the discovery of a new type of topological class, referred to herein as a guild, that captures global connectivity similarity. Lastly, the correspondence of these guilds to biological function is validated via an E. Coli gene regulation network. This resulted in biological findings that could not be derived from local topology of the original network.
Gene coexpression relationships that are phylogenetically conserved between human and mouse have been shown to provide important clues about gene function that can be efficiently used to identify promising candidate genes for human hereditary disorders. In the past, such approaches have considered mostly generic gene expression profiles that cover multiple tissues and organs. The individual genes of multicellular organisms, however, can participate in different transcriptional programs, operating at scales as different as single-cell types, tissues, organs, body regions or the entire organism. Therefore, systematic analysis of tissue-specific coexpression could be, in principle, a very powerful strategy to dissect those functional relationships among genes that emerge only in particular tissues or organs. In this report, we show that, in fact, conserved coexpression as determined from tissue-specific and condition-specific data sets can predict many functional relationships that are not detected by analyzing heterogeneous microarray data sets. More importantly, we find that, when combined with disease networks, the simultaneous use of both generic (multi-tissue) and tissue-specific conserved coexpression allows a more efficient prediction of human disease genes than the use of generic conserved coexpression alone. Using this strategy, we were able to identify high-probability candidates for 238 orphan disease loci. We provide proof of concept that this combined use of generic and tissue-specific conserved coexpression can be very useful to prioritize the mutational candidates obtained from deep-sequencing projects, even in the case of genetic disorders as heterogeneous as XLMR.
disease-gene prediction; functional annotation; transcriptome; phenome
Establishing a functional network is invaluable to our understanding of gene function, pathways, and systems-level properties of an organism and can be a powerful resource in directing targeted experiments. In this study, we present a functional network for the laboratory mouse based on a Bayesian integration of diverse genetic and functional genomic data. The resulting network includes probabilistic functional linkages among 20,581 protein-coding genes. We show that this network can accurately predict novel functional assignments and network components and present experimental evidence for predictions related to Nanog homeobox (Nanog), a critical gene in mouse embryonic stem cell pluripotency. An analysis of the global topology of the mouse functional network reveals multiple biologically relevant systems-level features of the mouse proteome. Specifically, we identify the clustering coefficient as a critical characteristic of central modulators that affect diverse pathways as well as genes associated with different phenotype traits and diseases. In addition, a cross-species comparison of functional interactomes on a genomic scale revealed distinct functional characteristics of conserved neighborhoods as compared to subnetworks specific to higher organisms. Thus, our global functional network for the laboratory mouse provides the community with a key resource for discovering protein functions and novel pathway components as well as a tool for exploring systems-level topological and evolutionary features of cellular interactomes. To facilitate exploration of this network by the biomedical research community, we illustrate its application in function and disease gene discovery through an interactive, Web-based, publicly available interface at http://mouseNET.princeton.edu.
Functionally related proteins interact in diverse ways to carry out biological processes, and each protein often participates in multiple pathways. Proteins are therefore organized into a complex network through which different functions of the cell are carried out. An accurate description of such a network is invaluable to our understanding of both the system-level features of a cell and those of an individual biological process. In this study, we used a probabilistic model to combine information from diverse genome-scale studies as well as individual investigations to generate a global functional network for mouse. Our analysis of the global topology of this network reveals biologically relevant systems-level characteristics of the mouse proteome, including conservation of functional neighborhoods and network features characteristic of known disease genes and key transcriptional regulators. We have made this network publicly available for search and dynamic exploration by researchers in the community. Our Web interface enables users to easily generate hypotheses regarding potential functional roles of uncharacterized proteins, investigate possible links between their proteins of interest and disease, and identify new players in specific biological processes.
Biological networks are a topic of great current interest, particularly with the publication of a number of large genome-wide interaction datasets. They are globally characterized by a variety of graph-theoretic statistics, such as the degree distribution, clustering coefficient, characteristic path length and diameter. Moreover, real protein networks are quite complex and can often be divided into many sub-networks through systematic selection of different nodes and edges. For instance, proteins can be sub-divided by expression level, length, amino-acid composition, solubility, secondary structure and function. A challenging research question is to compare the topologies of sub- networks, looking for global differences associated with different types of proteins. TopNet is an automated web tool designed to address this question, calculating and comparing topological characteristics for different sub-networks derived from any given protein network. It provides reasonable solutions to the calculation of network statistics for sub-networks embedded within a larger network and gives simplified views of a sub-network of interest, allowing one to navigate through it. After constructing TopNet, we applied it to the interaction networks and protein classes currently available for yeast. We were able to find a number of potential biological correlations. In particular, we found that soluble proteins had more interactions than membrane proteins. Moreover, amongst soluble proteins, those that were highly expressed, had many polar amino acids, and had many alpha helices, tended to have the most interaction partners. Interestingly, TopNet also turned up some systematic biases in the current yeast interaction network: on average, proteins with a known functional classification had many more interaction partners than those without. This phenomenon may reflect the incompleteness of the experimentally determined yeast interaction network.
Even in the post-genomic era, the identification of candidate genes within loci associated with human genetic diseases is a very demanding task, because the critical region may typically contain hundreds of positional candidates. Since genes implicated in similar phenotypes tend to share very similar expression profiles, high throughput gene expression data may represent a very important resource to identify the best candidates for sequencing. However, so far, gene coexpression has not been used very successfully to prioritize positional candidates.
We show that it is possible to reliably identify disease-relevant relationships among genes from massive microarray datasets by concentrating only on genes sharing similar expression profiles in both human and mouse. Moreover, we show systematically that the integration of human-mouse conserved coexpression with a phenotype similarity map allows the efficient identification of disease genes in large genomic regions. Finally, using this approach on 850 OMIM loci characterized by an unknown molecular basis, we propose high-probability candidates for 81 genetic diseases.
Our results demonstrate that conserved coexpression, even at the human-mouse phylogenetic distance, represents a very strong criterion to predict disease-relevant relationships among human genes.
One of the most limiting aspects of biological research in the post-genomic era is the capability to integrate massive datasets on gene structure and function for producing useful biological knowledge. In this report we have applied an integrative approach to address the problem of identifying likely candidate genes within loci associated with human genetic diseases. Despite the recent progress in sequencing technologies, approaching this problem from an experimental perspective still represents a very demanding task, because the critical region may typically contain hundreds of positional candidates. We found that by concentrating only on genes sharing similar expression profiles in both human and mouse, massive microarray datasets can be used to reliably identify disease-relevant relationships among genes. Moreover, we found that integrating the coexpression criterion with systematic phenome analysis allows efficient identification of disease genes in large genomic regions. Using this approach on 850 OMIM loci characterized by unknown molecular basis, we propose high-probability candidates for 81 genetic diseases.
Similarity in gene expression pattern between closely linked genes is known in several eukaryotes. Two models have been proposed to explain the presence of such coexpression patterns. The adaptive model assumes that coexpression is advantageous and is established by relocation of initially unlinked but coexpressed genes, whereas the neutral model asserts that coexpression is a type of leaky expression due to similar expressional environments of linked genes, but is neither advantageous nor detrimental. However, these models are incompatible with several empirical observations. Here, we propose that coexpression of linked genes is a form of transcriptional interference that is disadvantageous to the organism. We show that even distantly linked genes that are tens of megabases away exhibit significant coexpression in the human genome. However, the linkage is more likely to be broken during evolution between genes of high coexpression than those of low coexpression and the breakage of linkage reduces gene coexpression. These results support our hypothesis that coexpression of linked genes in mammalian genomes is generally disadvantageous, implying that many mammalian genes may never reach their optimal expression pattern due to the interference of their genomic environment and that such transcriptional interference may be a force promoting recurrent relocation of genes in the genome.
gene order; linkage; gene expression; coexpression; evolution; mammals
The recent explosion in biological and other real-world network data has created the need for improved tools for large network analyses. In addition to well established global network properties, several new mathematical techniques for analyzing local structural properties of large networks have been developed. Small over-represented subgraphs, called network motifs, have been introduced to identify simple building blocks of complex networks. Small induced subgraphs, called graphlets, have been used to develop "network signatures" that summarize network topologies. Based on these network signatures, two new highly sensitive measures of network local structural similarities were designed: the relative graphlet frequency distance (RGF-distance) and the graphlet degree distribution agreement (GDD-agreement).
Finding adequate null-models for biological networks is important in many research domains. Network properties are used to assess the fit of network models to the data. Various network models have been proposed. To date, there does not exist a software tool that measures the above mentioned local network properties. Moreover, none of the existing tools compare real-world networks against a series of network models with respect to these local as well as a multitude of global network properties.
Thus, we introduce GraphCrunch, a software tool that finds well-fitting network models by comparing large real-world networks against random graph models according to various network structural similarity measures. It has unique capabilities of finding computationally expensive RGF-distance and GDD-agreement measures. In addition, it computes several standard global network measures and thus supports the largest variety of network measures thus far. Also, it is the first software tool that compares real-world networks against a series of network models and that has built-in parallel computing capabilities allowing for a user specified list of machines on which to perform compute intensive searches for local network properties. Furthermore, GraphCrunch is easily extendible to include additional network measures and models.
GraphCrunch is a software tool that implements the latest research on biological network models and properties: it compares real-world networks against a series of random graph models with respect to a multitude of local and global network properties. We present GraphCrunch as a comprehensive, parallelizable, and easily extendible software tool for analyzing and modeling large biological networks. The software is open-source and freely available at . It runs under Linux, MacOS, and Windows Cygwin. In addition, it has an easy to use on-line web user interface that is available from the above web page.