|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Genes and proteins are organized into functional modular networks in which the network context of a gene or protein has implications for cellular function. Highly connected hub proteins, largely responsible for maintaining network connectivity, have been found to be much more likely to be essential for yeast survival.
Here we investigate the properties of weighted gene co-expression networks formed from multiple microarray datasets. The constructed networks approximate scale-free topology, but this is not universal across all datasets. We show strong positive correlations between gene connectivity within the whole network and gene essentiality as well as gene sequence conservation. We demonstrate the preservation of a modular structure of the networks formed, and demonstrate that, within some of these modules, it is possible to observe a strong correlation between connectivity and essentiality or between connectivity and conservation within the modules particularly within modules containing larger numbers of essential genes.
Application of these techniques can allow a finer scale prediction of relative gene importance for a particular process within a group of similarly expressed genes.
Genes and their protein products carry out cellular processes in the context of functional modules and are related to each other through a complex network of interactions . Understanding an individual gene or protein's network properties within such networks may prove to be as important as understanding its function in isolation . Because of this, numerous studies have focused on the large scale modeling of genomic and proteomic data. Utilizing network theory, these studies have yielded insights into biological systems. For example, both protein interaction networks and gene co-expression networks exhibit a strong modularity reflecting functional partitioning. Both of these network types have been frequently observed as having a scale-free topology, with the existence of highly connected hub nodes [3-7].
Scale-free networks are resistant to random perturbations but sensitive to targeted removal of highly connected nodes . Comprehensive efforts to determine the functional consequences of individual gene deletions in yeast provide the opportunity to study the relationship between individual gene network properties and gene deletion lethality . For example, physical interaction studies in yeast have allowed comparison of connectivity to gene essentiality based on gene deletion. . Typical of scale-free networks, there were few highly connected proteins within the network, and the deletion of a protein with a large number of binding partners is more likely to be lethal in yeast. Thus, the relative position of nodes within a protein interaction network is strongly affiliated with distinct biological properties of individual proteins. Similarly, analysis of unweighted gene co-expression networks have revealed a relationship between connectivity and essentiality across all genes .
Correlation of gene expression across a wide variety of experimental perturbations has been shown to cluster genes of similar function . Since this guilt-by-association approach may lead to false positive groupings, approaches have been refined by comparing orthologs across divergent species indicating that highly conserved co-expression is a strong predictor that two genes will function in similar pathways [13-15]. This indicates that functionally related genes are under similar expression constraints. The gene co-expression networks that are based on these relationships have been shown in multiple species to be frequently scale-free and exhibit a small world architecture similar to protein interaction networks even though they are generally more strongly connected [3,14]. It is still unclear however, to what degree the network properties of individual genes within a co-expression network can predict relative gene importance for a particular process. To assess this, we have constructed three networks based on a weighted measure of connectivity  of correlated gene expression in yeast using three separate microarray data sets. We assessed relationships between essentiality and connectivity of each gene within the whole network. Further, we define 'modules' (groups of highly correlated genes) and determine that in some instances, the relative importance of genes within these modules can be inferred from network connectivity. We demonstrate that genes which have high connectivity (i.e. 'hub' genes) within a weighted co-expression network are significantly more likely to be essential for yeast viability. Furthermore, we demonstrate a relationship between connectivity and a measure of sequence conservation. Finally, we show that certain critical modules are conserved from one network to another, and that in many cases it is possible to extend the relationship between connectivity and essentiality or between connectivity and conservation within given modules. Thus, analysis of gene co-expression networks provides insight into the functional importance of individual genes within modules of co-expressed genes.
In order to test the relationship between gene connectivity within the co-expression network and gene essentiality, we first constructed three networks corresponding to three microarray datasets. The yeast microarray data were derived from experiments designed to analyze DNA Damage (DD) (n = 51), Cell Cycle (CC) (n = 44), and Environmental Response (ER) (n = 151) [17-19]. Each network was constructed as conceptually shown in Figure Figure1.1. Nodes were defined as individual genes. The co-expression network is based upon the correlation relationships between individual genes across a dataset (Figure (Figure1A).1A). From the microarray data, the absolute value of the Pearson correlation is the initial measure of gene co-expression similarity (Figure (Figure1B).1B). This co-expression similarity was then transformed into an adjacency matrix (Figure (Figure1C)1C) by raising the similarity to a certain power based on a scale free topology criterion as described previously . The powers selected were 12 (DD), 10 (CC), and 18 (ER). These data can also be visualized as a drawn network (Figure (Figure1D1D).
Each of the datasets resulted in the construction of distinct networks but with global similarities. The differences in the modularity of the three networks is demonstrated by comparing the heat maps of the topological overlap matrices (Figures (Figures3A,3A, ,4A,4A, and and5A).5A). A topological overlap matrix is an indicator of the similarity of all the nodes in question and is defined in detail in the methods section of this manuscript. Approximations of all three of the networks are also drawn in Figures Figures3B,3B, ,4B4B and and5B5B using the Fruchterman-Reingold algorithm as implemented in Pajek [20,21] and demonstrate obvious modularity with aggregation of genes into strongly co-expressed groups.
Individual gene connections can be difficult to visualize in such complex graphs as presented. Thus, in order to probe the structure of the network, we first sought to determine characteristics of the general network topology. We considered two models. To quantify the fit to the scale free topology model or to the truncated exponential model, we used the model fitting index R-squared of the corresponding linear model involving the log transformed variables . The pattern of network connectivity in these datasets approximated a scale-free topology (Figures 2A, 2B, and and2C).2C). These results are similar to the general trend with genetic perturbation networks and other gene co-expression networks [3,22]. However, gene co-expression networks may better fit an exponential power law . For two of the three constructed yeast networks (DD and ER), the distribution of connectivity (k) of genes best fits a power law p(k) ~ k-γ and the other (CC) best fits an exponentially truncated power law p(k) ~ k-γ e-αk . We point out that in general that regardless of the overall structure, the major trend is that the high connectivity genes are few in number, and that most genes have low connectivity (Figure 2A–2C).
Network modeling in biological systems [2,8,11] stresses the relative importance of highly connected nodes. Thus, the small number of high connectivity genes within the gene co-expression networks defined here are expected to be more important. To test the relative importance of each gene across the entire network, we determined the relationship between connectivity and gene essentiality. All the genes for a network were first ranked according to their weighted connectivity and then placed into 20 equal sized bins according to their position in this ranking. The weighted connectivity of a given gene is defined as the sum of its connection strengths with all other genes in the network (see the methods section). The percentage of essential genes in each bin was then plotted against the mean connectivity in each bin (Figure 2D–2F). We also calculated the Spearman correlation coefficient based on the ranked unbinned data and found that there is a relationship between the connectivity and the proportion of essential genes in each of the three datasets (R > 0.2, and p value < 10-39) for each network. While the binned plots are well suited for visualizing the results, we calculated the correlation coefficients on the raw, unbinned data to arrive at a valid p-value.
To study whether our conclusions were being unduly influenced by data outliers, we removed genes whose connectivity was more than two standard deviations away from mean connectivity but found no substantial variation in the reported relationships. Thus, there is a clear relationship between connectivity of genes, in these weighted gene co-expression networks, and gene essentiality across the whole network. These results are robust with respect to experimental conditions since the three separate datasets all have the same functional relationship between connectivity and gene essentiality. These results are consistent with a single whole network generated from gene co-expression data and using an unweighted network , as well as networks based on protein-protein interactions .
Yeast co-expression networks fit an evolutionarily plausible model in which network connection distribution is a natural consequence of gene duplication, mutation and deletions . The emergence of approximate power-law distribution (scale free topology) is intimately linked to the growth of the network in which new nodes are preferentially attached to highly connected already established nodes [2,8]. A direct consequence of this network growth model is that high connectivity 'hub' genes are more likely to be highly conserved across different species. Stated another way, within a gene co-expression network the oldest nodes would be more likely than the newer nodes to be in hub positions. With this in mind, we examined the relationship between connectivity and the relative evolutionary sequence conservation of a node (as evidenced by the average best blastp hit log(e score) against N. crassa C. elegans, D. melanogaster, H. sapiens, and M. musculus). High connectivity genes are plotted in Figure 2G–2I relative to the average of the blastp hit in four other eukaryotic genomes. As before, we calculated the Spearman correlation for the ranked, but still unbinned, data and found that there is a relatively weak but highly significant correlation indicating that high connectivity genes are more evolutionarily conserved (R = -0.15, p value = 1.3 × 10-19, for the DD network; R = -0.22, p value < 1.4 × 10-39 for ER network; and R = -0.18, p value = 4.0 × 10-29 for the CC network). Our results were robust with respect to the removal of outlier genes (i.e. genes whose connectivity was more than two standard deviations away from mean connectivity). This relationship is similar to that reported for gene perturbation networks  and in evolutionarily conserved gene co-expression relationships .
In order to identify modules, we performed hierarchical clustering on the topological overlap matrix derived from these networks as described in the methods section. In doing so, we identified modules that involve conserved processes for rRNA processing, protein synthesis, and ubiquitination (Table (Table1)1) using an EASE analysis . One prediction about network constructions is that certain critically important modules ought to be repeatedly detected in independent datasets. Within the network modules (groups of high connectivity genes within the network) produced by the different datasets for gene ontology enrichment, 3 distinct modules that were common to the 3 different data sets were detected (Table (Table11 and Figure Figure6).6). Comparing the modules that had an enrichment in rRNA processing, protein synthesis and the ubiquitin pathway terms across the DD, CC and ER datasets resulted in a highly significant overlap of 215 genes (hypergeometric distribution p = 3.1e-173). Similarly, the protein synthesis module also had significantly more overlap than expected by chance in which 172 genes were present in all three modules (p = 5.6e-99). The ubiquitin module had the least overlap of common genes (12 genes, p = 3.4e-07) probably because of its smaller size in general.
These three processes are well known to be fundamental processes and have been described before in great detail, and so we decided to investigate the nature of these overlapping regions further. Gene members were selected that were shared within similar functional modules from the three datasets and searched for in the KEGG  database. In the ubiquitin pathway, 9 of the overlapping 12 genes were found in KEGG, and all of these were found to be in the same proteasome structure. Since the overlap for this module was unusually small due in part to the very small size of the ER dataset's ubiquitin module, we also looked at the overlap just of the CC and DD ubiquitin modules. Of the 57 genes present in both the CC and DD modules, only 14 were present in the KEGG database. Of those 14 members, all of them were in the same proteasome structure. For the Protein Biosynthesis module, 135 of 172 overlapping members were represented in the KEGG database, and of these, 72 out of a possible 86 were components of the KEGG annotated yeast ribosome. Notably, if the 172 overlapping members were first rank ordered by their average network connectivity, and the bottom quartile removed, then 100% of the remaining members are part of this same yeast ribosome structure, which shows that in this overlap set the high connectivity genes were in fact the members of the yeast ribosome. Finally, the overlap of genes detected in the rRNA modules overlap identified purine and pyrimidine synthesis and RNA polymerase 3 pathways. These pathways dominate this list with 30 out of the 49 identified matches to the KEGG database. Unlike the previous two results however, multiple pathways are found that represent this module. The three network modules discussed all contain large protein complexes. This is in agreement with the recent discovery that there is often a correlation between genes that are expressed together and proteins that bind to each other . However, there is no reason to believe that modules in co-expression networks will always correspond to protein complexes since our module detection method should find any set of genes with high pair-wise correlations. Any experimental perturbation that gives rise to a highly correlated gene set is expected to result in a corresponding network module.
Having identified 3 functionally similar modules in the three different datasets, we assessed whether these same modules contain the same information as the entire network. Is the connectivity of a gene within the identified modules correlated with the likelihood that a gene was essential by deletion analysis and also whether it was conserved (Figures (Figures3,3, ,4,4, ,5)?5)? To test connectivity correlations within these modules, we define the intramodular connectivity (k.in) as the sum of all connectivity measures only with genes within the same module. Similar to that performed for the total gene connectivity within the whole network, binned values were plotted to observe the possible relationships (Figures 3C–H, 4C–H, and 5C–H). To evaluate significance, Spearman correlation coefficients for the ranked but unbinned data were calculated for relationship of k.in to essentiality and sequence conservation. Interestingly, there was a general trend to similar correlations as that seen within the whole network. The correlation with sequence conservation was generally weak: The ubiquitin module formed from the DD dataset and the rRNA processing modules in the ER and CC were the most correlated with k.in. There was a correlation of k.in and essentiality in all three datasets within the rRNA Processing module (Figures (Figures3C,3C, ,4C4C and and5C),5C), and for two of the datasets (DD and ER but not CC) within the ubiquitin module, and a trend towards a weak anticorrelation within the protein synthesis module (Figures (Figures3D,3D, ,4D,4D, and and5D).5D). Thus, some of the relationships between connectivity and essentiality or sequence conservation and not detected when the analysis is restricted within the modules.
The relationship between intramodular connectivity and essentiality is not necessarily universal for any module in any dataset. Whether or not a module contains meaningful information for predicting essentiality depends on the expression data (biological perturbations) that were analyzed. One can view the correlation of k.in and essentiality as a test of the modules importance. The key issue is that in order for a module to demonstrate this relationship, there must be a sufficient number of essential genes within the module. As some modules have substantially fewer essential genes than others, there is little power to detect any relationship with k.in within those modules. We calculated the Spearman correlation between the proportion of essential genes within a module and the correlation of k.in with essentiality and show that there is a strong positive correlation (r = 0.48). Since in actual applications the goal will be to find unknown 'essential' genes, it would be preferable to have another metric of 'essential' gene enrichment. In this regard, it is interesting to note that of the three modules studied in the 3 datasets (9 total), there were 4 that had a correlation of k.in with sequence conservation (p < 0.05). Of these 4 modules, all 4 were shown to have a strong relationship of k.in with essentiality in the same modules. Of the 5 without a correlation (p > 0.05), only one showed a correlation between k.in and essentiality (DD, ubiquitin module). Thus, the detection of a relationship between intramodular connectivity and sequence conservation may predict those modules with a relationship between k.in and essentiality.
Our results extend the inferences made from protein and metabolic networks to the more prolific datasets available from gene expression data. These results are made possible due to broad surveys of expression and the comprehensive characterization of the yeast genome with deletion mutants. In contrast to gene expression data, which are based on the assessment of cells in many different states of perturbation, protein interaction measurements are determined in only a limited set of conditions which by nature only capture a subset of all possible protein-protein interactions. Others [22,28,29] have constructed gene perturbation networks by looking at the expression changes when individual genes are eliminated or in relation to transcription factor expression. Microarray based surveys of gene expression allow detailed surveys of gene expression across a wide variety of experimental perturbations. From these data a strong correlation of gene expression for two genes implies that both genes act within a common functional group and are under similar transcriptional control. Since these networks are not based on direct measurements of protein binding/interaction, the presence of two genes that are highly correlated has a different interpretation than that derived from a protein interaction network. In a co-expression network, two closely tied nodes would be highly correlated but not necessarily direct binding partners. In spite of this key difference, numerous other studies have indicated that genes that are correlated with each other across a modest range of experiments tend to function in similar ways and that this functional categorization is strengthened when gene co-expression exists across large evolutionary distances [12,14].
Here, we use a more general weighted gene co-expression network construction method for network construction that can be applied generically to microarray data using simple correlational analyses of individual genes . Advantages of weighted networks over unweighted networks based on dichotomizing the correlation matrix include a) that the continuous nature of the gene co-expression information is preserved and b) that the results of weighted network analyses are highly robust with respect to the choice of the parameter β, . We show that there is a strong correlation with the weighted gene co-expression connectivity measure and gene essentiality and sequence conservation. Our analysis demonstrates that individual conserved modules are identifiable within a network. Moreover, the presence of these modules, which were detected in three independent networks, illustrates that these networks have a structure to them which matches known cellular biology. We have also shown that when sufficient information is available, these correlation modules can provide further information about which of their members are more likely to be important for a given process. The ability to assess whether or not a module is complete enough to be informative using a test like either of the ones employed here will be important for validating the merit of less understood modules in the future.
Our study extends the strong links between the network properties of a gene within co-expression networks and the global importance of each gene to the yeast cell, and may have implications for targeted drug therapy in more complex biological systems like cancer. For instance, one can infer from an analysis of the single gene deletion data that single targets for inhibition can be partially predicted from expression data. However, in biological systems an optimal effect is frequently achieved by inhibiting multiple specific targets. Can we extrapolate network positions and relationships to infer effective combinations of targeted therapeutics? Yeast is an excellent model for establishing the rules for applying network theories to practical predictions. For instance, will double deletions of high connectivity genes within the network which reside in the same module have substantially greater negative consequences to the yeast than double deletions of low connectivity genes? Alternatively, will genes which are connecting distinct modules, and serving as key connections between integrators prove to be better disruption targets? The modeling of such proposed interactions in yeast is attractive because of the relative tractability of yeast studies and the availability of large scale datasets. Technologies like synthetic genetic arrays which automate double deletion screens or chemical genomic profiling provide an expanding capability to survey the effects of multiple target inhibition or deletion [30-32]. However, even with these powerful methods, only a small fraction of the millions of possible combinations will be assayed. For instance in a substantial recent survey of double deletions only 0.02% of all possible double deletions were tested for synthetic lethality . Thus, network based modeling may be a useful adjunct for the design of double knockout experiments or combination inhibition experiments to permit a more efficient search for synthetic lethals. An interplay between predictive models and targeted large scale inhibition studies may be quite useful for surveying pathway analysis in yeast. In the long term, such studies may also be critical in the development of therapies for diseases like cancer.
In conclusion, our study demonstrates that weighted gene co-expression networks can be used to predict both the likelihood of essentiality and the likelihood of conservation for their gene members within the entire network. Furthermore, we show that these networks can be subdivided into identifiable modules, which themselves can often contain enough individual information to indicate which genes are more likely to be crucial for a particular process. We have indicated a means for testing whether this is true or not. Thus, we have illustrated how this process can be applied to common microarray data for the purpose of predicting both functional assignment as well as the ranking of gene significance in a particular process with which it is found to be associated. We believe that these kinds of predictions will be helpful for identifying gene targets for therapeutics and for directing multi-drug therapies.
Yeast microarray data were obtained from the Saccharomyces Genome Database in ratio format . In each case, microarray based gene expression analysis was previously reported to have been performed for the analysis of DNA Damage , Cell Cycle , and Environmental Response . Not all of the data obtained from each datasets listed was used in these experiments. Because missing data is especially confounding for our analysis, arrays where greater than 5% of the gene expression information was missing were removed from our analysis. The exact number of arrays chosen to represent each dataset was: DNA Damage (n = 51), Cell Cycle (n = 44), and Environmental Response (n = 151). Essential genes were determined from the Saccharomyces Genome Deletion Project . Genomic data for conducting the blastp searches were obtained from the following locations: Neurospora genome data (version 7) was obtained from the Broad Institute [36,37]. Drosophila genome data was obtained from the Berkeley Drosophila genome project , C. elegans genome data was obtained from the wormbase database , while both the human and mouse genomes were obtained from the UCSC genome browser . The blastp algorithm was run locally.
In gene co-expression networks, each gene corresponds to a node. The neighbors of a node i are the nodes that are connected to the node i. Two genes are connected by an edge with a weight indicating the connection strength. A gene co-expression network can be represented by an adjacency matrix A = [aij], where aij is the weight of a connection between two nodes i and j. For all of the networks considered in this paper, the connectivity equals the sum of connection weights being considered.
Our analysis follows a general framework for construction of gene co-expression networks . To transform the co-expression measure (here the absolute value of the Pearson correlation matrix) into measures of pair wise connection strengths, one can make use of an adjacency function. The choice of the adjacency function determines whether the resulting network will be weighted (soft thresholding) or unweighted (hard thresholding). A widely used adjacency function is the signum function which implements 'hard' thresholding involving the threshold parameter tau. Specifically, aij = Ind(|cor(xi, xj)| > τ where the indicator function takes on the value 1 if the condition is satisfied and 0 otherwise. Hard thresholding using the signum function leads to intuitive network concepts (e.g., the node connectivity equals the number of direct neighbors), but it may lead to a loss of information. For instance, if the threshold has been set to 0.8, there will be no connection between two genes if their correlation equals 0.79. To avoid the disadvantages of hard thresholding, Zhang and Horvath  proposed a 'soft' power adjacency function: aij = |cor(xi, xj)|β with the single parameter beta. The power adjacency function preserves the continuous nature of the gene co-expression information and, leads to results that are highly robust with respect to the choice of beta.
Zhang and Horvath  also proposed a criterion for choosing the power beta of the adjacency function which is based on the fact that despite significant variation in their individual constituents and pathways, metabolic networks have been found to display approximate scale free topology. The linear regression model fitting index R^2 can be used to quantify how well a network satisfies a scale-free topology. While the scale free model has only 1 parameter (γ), the truncated exponential model allows for 2 parameters γ and α. Empirically, we find that the two parameters of the truncated exponential model provide too much flexibility in curve fitting so that the R-squared values are always high . Thus, we focus on the scale free topology model fitting index in this application and emphasize the major trend that the high connectivity genes are few in number, and that most genes have low connectivity (Figure 2A–2C). There is a natural trade-off between maximizing scale-free topology model fit (R^2) and maintaining a high mean number of connections: parameter values that lead to an R^2 value close to 1 may lead to networks with very few connections. These considerations motivated the scale-free topology criterion. To choose the parameters of an adjacency function: Only those parameter values that lead to a network satisfying scale-free topology at least approximately were considered (e.g. signed R^2 > 0.80). When considering the signum and power adjacency functions, we find the relationship between R^2 and the adjacency function parameter (tau or beta) is characterized by a saturation curve. In our applications, we use the first parameter value where saturation is reached as long as it is above 0.8. An R^2 > .80 results in networks that may only approximate the scale free property as is highlighted by the Cell Cycle network in this manuscript.
For each dataset, we first selected the top 4000 most varying transcripts based on the standard deviation. The network adjacency between 2 genes i and j was defined as a power of the correlation coefficient between the corresponding gene expression profiles xi and xj, i.e. aij = |cor(xi, xj)|β where the power beta is chosen using the scale free topology criterion. From this function, the 3000 most-connected genes were selected for further detailed network analysis. Using the 4000 most varying transcripts was not very restrictive since the original data are derived from yeast arrays and therefore only contained only about 6000 genes in total. Many of these genes needed to be removed due to high rates (>40%) of missing data for a gene across a data set. By definition, module genes are highly connected with the genes of their module (i.e. module genes tend have relatively high connectivity). Thus, for the purpose of module detection, restricting the analysis to the most connected genes does not lead to major information loss for the key points of this presentation. However, there may be applications where genes with relatively low connectivity are biologically interesting so that gene filtering based on connectivity would lead to information loss. The adjacency matrix was also visualized as a graph and the number of topological properties of this graph examined. Important network derived concepts include: hub (a node with many strong connections to other nodes), topological overlap (the interconnectivity between two nodes), modules (tightly connected subsets of the network), connectivity (the sum of the adjacencies, ie. ) and intramodular connectivity (the sum of the adjacencies within a module). One potential drawback of soft thresholding is that the network cannot be visualized directly because the number of relationships to display becomes intractable. A soft adjacency matrix specifies pair wise connection strengths and all nodes are connected. To visualize the network with Pajek, one needs to threshold the connection strengths. The resulting dichotomized adjacency matrix can be considered as an unweighted network approximation of the weighted network. Therefore, to make a visual representation, only the strongest correlations (0.9 or greater) were drawn in these renderings. Subsequent to this, networks were then visualized with Pajek [20,41].
Topological overlap of two nodes (genes) reflects their relative interconnectivity. For a network represented by an adjacency matrix A = [aij], aij [0,1], a well-known formula for defining topological overlap was given by , and it can be extended to weighted networks:
where, denotes the number of nodes to which both i and j are connected, and k is the number of connections of a node, with and . Since aij [0,1], we find that . Thus, . Since 1 - aij ≥ 0, we find that ωij is a number between 0 and 1. There are 2 reasons for adding 1 - aij to the denominator in the topological overlap matrix: 1) in this form, the denominator can never be 0 and 2) for an unweighted network, one can show that ωij = 1 only if the node with fewer links satisfies two conditions: (a) all of its neighbors are also neighbors of the other node, i.e. it is connected to all of the neighbors of the other node and (b) it is linked to the other node.
In contrast, ωij = 0 if i and j are unlinked and the two nodes don't have common neighbors. Further, the topological overlap matrix is symmetric, i.e, ωij = ωji. and its diagonal elements are set to 0 (i.e. ωii = 0). The rationale for considering this similarity measure is that nodes that are part of highly integrated modules are expected to have high topological overlap with their neighbors .
It is important to note that the definition of topological overlap matrix (equation 1) was taken from the methods supplement of reference  since there is a typo in the primary text. Different definitions of the topological overlap matrix are possible, but we have found empirically that many of these possible generalizations lead to very similar modules .
Modules are groups of nodes that have high topological overlap. Module identification is based on the topological overlap matrix Ω = [ωij defined as (1). To use it in hierarchical clustering, it is turned into a dissimilarity measure by using the standard approach of subtracting it from one (i.e, the topological overlap based dissimilarity measure is defined by = 1 - ij).
In networks involving few nodes, modules can easily be identified by inspecting the network, but for large networks involving hundred of nodes, it is useful to generate a topological overlap matrix (TOM) plot, which we review in the following. A TOM plot provides a "reduced" view of the network that allows one to visualize and identify network modules. The TOM plot is a color-coded depiction of the values of , described above.
Based on the TOM based dissimilarity matrix we can use hierarchical clustering to discriminate one module from another. We used a dynamic cut-tree algorithm for automatically and precisely identifying modules in hierarchical clustering dendrogram . The algorithm makes use of the internal structure in a dendrogram and clips branches of the dendrogram. The algorithm is based on an adaptive process of cluster decomposition and combination and the process is iterated until the number of clusters becomes stable. No claim is made that our module construction method is optimal and future research should compare different method for cluster identification. Each module represents a group of genes with similar expression profiles across the samples and the expression profile pattern is distinct from those of the other modules. In each case, the modules were then assessed for enrichment in ontology terms based on Fisher's exact test as described in . These modules and their relative connectivities were then separated out for further analysis.
Based on the network modules, we can define the intramodular connectivity (k.in) for each node i (i.e., the number of neighbors in the same module as node i). The connectivity of node i across the whole network is denoted as k.all.
The translated sequences of the entire yeast genome were each blasted against the entire translated sequences of the Human, Mouse, Fly, Neurospora and C. elegans genomes. In each case, the best hit was determined using blastp, and the log(e score) scores were averaged across the 5 different comparisons in a manner similar to what was published by . These average blastp scores were then averaged for each point in (Figure 2H–2K, 3F–3H, 4F–4H, 5F–5H). This method is not necessarily an optimal measure of gene conservation, and it is used here simply as an approximate measure of gene conservation, that is independent of annotation. Future research should study the advantages and disadvantages of using different measures for gene conservation in network analysis.
We used a hyper-geometric p-value to evaluate the significance of the overlap of the module genes across the 3 data sets. Suppose N1, N2 N3 out of N genes are in rRNA processing module of the DNA Damage data, the Environmental Response data, and the Cell Cycle data, respectively. We then denote by 'overlap' the number of genes in the rRNA module that overlap in the 3 data sets.
To arrive at an upper limit of the p-value, we first computed pair wise p-values as follows. Denote as p12 the p-value that by randomly selecting N2 genes from a pool of N genes with N1 module 1 genes, ≥ overlap selected genes are in the rRNA-processing module in both data sets. This p-value can easily be computed using the hypergeometric distribution (dhyper function in R). N1 and N2 were then swapped to obtain a second p-value, which we denoted as p21. The maximum of those two values can be used to assess the significance of overlap between data sets 1 and 2. Similarly, we computed a p-value of overlap between data sets 1 and 3, and data sets 2 and 3. The minimum of the 3 resulting p-values provided an upper limit for assessing the significance of overlap between all 3 data sets, i.e. the final p-value was computed as p = min(max(p12,p21), max(p13,p31), max(p23,p32))
Prepared Data and R code are available online .
CC, cell cycle; DD, DNA damage; ER, environmental response
MC and SFN conceived of and developed the project, and wrote the final manuscript. MC analyzed the data, participated in the construction of all the software tools used in this study and drafted the manuscript and figures. SH, BZ authored the majority of the R code needed for developing these weighted networks, developing network measures, and providing crucial statistical guidance for this project. ZF helped with the statistical analysis. PM and SFN both contributed key ideas and suggestions as the project progressed. All authors participated in the development of the ideas presented in this manuscript. All authors read and approved the final manuscript.
The authors extend their thanks to the SGD for providing data for this analysis. Also, special mention should be made of Allen Day, Ally Perlina and Andy Yip for their input during the analysis of this project. The work was supported by a grant from the NCI (U01 CA881273) as well as the Henry Singleton Brain Tumor Program, the Elliott Family Foundation in memory of Neal Elliott, the Ziering Family Foundation in memory of Sigi Ziering, Accelerate Brain Cancer Cure, and Art of the Brain. MRJC was supported by the UCLA Tumor Cell Biology Training Program, funded by NIH/National Cancer Institute 5T32CA09056. This work was also supported by the UCLA DNA Microarray Facility and the NINDS/NIMH Microarray Consortium.