Mammalian coexpression networks
Tissue-specific expression profiles of human-mouse orthologous gene pairs were compared in order to evaluate the divergence of mammalian gene expression patterns. A total of 9,105 orthologous gene pairs were considered with respect to their expression levels across 28 tissues shared between the two species. All-against-all gene expression profile comparisons for the human and mouse matrices (9,105 × 28) were used to generate species-specific coexpression networks (Figure ). For coexpression networks, nodes correspond to genes, and edges link two genes from the same organism if their expression profiles are considered sufficiently similar (Figure ). A number of different metrics were used to measure the similarity (distance) between vectors of tissue-specific expression levels: Euclidean distance, Manhattan distance, Jensen-Shannon divergence, dot-product, cosine similarity and Pearson correlation coefficient. Results reported here are for networks constructed using the Pearson correlation coefficient (PCC
). The PCC
is widely employed for comparison of gene expression profiles and reflects similarity between expression patterns in terms of the relative expression levels across tissues. It should be noted that the results for the coexpression network analyses are qualitatively similar irrespective of the measure of profile similarity employed. Results of analyses based on the other measures of profile similarity (distance) are presented in the Supplementary Information section (see Additional file 1
) along with a discussion of the relationships among those measures.
Figure 1 Gene coexpression networks. a) The expression profile of a gene i (Gi) can be represented as a row vector with dimensions (n) equal to the number of tissues (28); all profiles taken together yield an m × n gene expression matrix (X) where m = (more ...)
The use of the PCC
to build coexpression networks is predicated on the choice of a threshold correlation coefficient (r
) at, or above which, genes are considered to be coexpressed and are thus connected by an edge in the network. As previously reported [21
], a series of increasing r
-values (0.4–0.9) was evaluated for utility in building coexpression networks. When r
-values << 0.7 are used, coexpression networks tend to congeal into graphs that are so densely connected as to preclude meaningful analysis of their topological properties. On the other hand, r
-value thresholds ranging from 0.7–0.9 yield analytically tractable networks and qualitatively similar results. Results for coexpression networks based on an r
-value threshold of 0.7 are reported here since this cutoff gives networks that are unlikely to contain many spurious edges but are sufficiently large and dense for robust topological analysis. For the 28-dimensional gene expression profiles evaluated here, an r
-value of 0.7 corresponds to a highly statistically significant correlation (P
= 3.4e-5). Furthermore, gene expression profiles with r
≥ 0.7 can be visually appreciated to be highly similar (Figure ).
Human and mouse coexpression networks were evaluated with respect to a number of parameters describing their global topological properties and found to be highly similar (Table ). The numbers of nodes and edges in each network are comparable, with the mouse network showing slighter higher values for both. The average degree (<k>) is the average number of edges per node and gives rough approximation of how dense the network is. The mouse network shows a slightly higher <k> which is consistent with the greater numbers of nodes and edges. However, <k> is again similar for both networks and rather high. By way of comparison, typical world-wide-web networks have <k>≈7. The values of <k> might not be particularly relevant because, as will be shown below, the degree distributions are highly skewed.
Global characteristics of the coexpression networks
A more refined notion of network density is given by the average clustering coefficient (<C>). The clustering coefficient C of a node i is defined as the fraction of the pairs of neighbors of node i that are linked to each other: Ci = 2ni/ki(ki-1), where ni is the number of observed links connecting the ki neighbors of node i and ki(ki-1)/2 is the total number of possible links. The average clustering coefficient (<C>) is the mean of this value for all nodes with at least two neighbors, and for both the human and mouse networks <C>≈0.4 (Table ). For networks of this size, these <C> values are considered to be quite high. By way of comparison, for randomly generated networks with the same number of edges and same degree (k) sequences, the expected <C> is estimated to be 0.0643 for human and 0.0529 for mouse. The high density of the coexpression networks is not necessarily surprising because, as one could reasonably expect, co-expression is, largely (but not entirely), transitive. In other words, if gene A is coexpressed with genes B and C, then genes B and C are likely to be coexpressed as well. However, the high observed values of <C> for the human and mouse networks do not appear to be due to the transitivity of the PCC similarity measure alone. This is demonstrated by the observation that networks built using the PCC measures between randomly permuted gene expression profiles, thus preserving some transitivity, also have values of <C> that are far lower than the observed values: human = 0.0933, mouse = 0.1229.
The average path length (<l>) is the average shortest path, or the smallest number of edges needed to connect two nodes, between any two reachable nodes in the network. Clearly, the co-expression networks exhibit "small world phenomena": on average, any two nodes are separated by only a few edges (Table ).
Node degree (k
) distributions were also computed for the human and mouse coexpression networks (Figure and ). In both cases, the distribution seems to follow a power-law, that is, the probability that a randomly chosen node has degree k
, is Pr [K
where the parameter α
, is the exponent of the power law distribution. While the degree distributions seem to be well approximated by a straight line in log-log scale (α = 1.13 for the human network and α = 1.11 for the mouse network by the least squares method), there appears to be an exponential drop-off in the tail of the distributions. Thus, the distributions are more appropriately described as a fat-tailed, power-law-like distributions rather than strict power-laws. Accordingly, evolutionary models that lead to pure power-laws, typically, with α >2, such as preferential attachment, would not apply to the evolution of this network. Additional details on these distributions are provided in the Supplementary Information (see Additional file 1
). Node degree distributions obtained using different distance (similarity) measures show similar fat-tailed properties and appear to be better approximated by power-laws than those obtained using the PCC
(see Additional file 1
; Supplementary Figure 3).
Node degree (k) distributions for human and mouse gene coexpression networks. All distributions are plotted in log10-log10 scale. Frequency distributions showing f(k) × k for human (a) and mouse (b).
It has been shown that analysis of the plot of the clustering coefficient C(k)
as a function of their degree ki
can yield insight to the structure of the network. In particular, it has been reported that the C(k)
distribution of networks with hierarchical structure follows a power-law, with a high interconnectivity among nodes of low degree that decreases as the degree increases [18
]. The C(k)
distribution of the human and mouse coexpression networks is more or less constant (Figures and ) implying that these networks, most likely, do not exhibit a hierarchical structure. Slightly different trends were observed for the different distance or similarity measures (see Additional file 1
; Supplementary Figure 4).
Figure 3 Clustering coefficient against node degree C(k) distributions for human (a) and mouse (b) gene coexpression networks. The degree (k) is shown on the x-axis and the average clustering coefficient <C> for all nodes with degree k is shown (more ...)
Human-mouse intersection network
As described above, the human and mouse gene coexpression networks are closely similar in terms of their global topological characteristics; they share similar node degree (k) distributions and C(k) distributions as well as similar average node degrees (<k>), clustering coefficients (<C>) and path lengths (<l>). We further sought to evaluate the similarity between the species-specific coexpression networks at a local level. There is as yet no general method for assessing local network similarity (or graph isomorphism). However, in the case of the human and mouse coexpression networks generated here, the use of orthologous gene pairs results in a one-to-one mapping between the nodes of the two networks. In this sense, the networks can be considered to be defined over the same set of nodes N, and thus can be directly compared by generating an intersection network. The human-mouse intersection network is defined as the network over the set of nodes N where there is a link between two nodes i and j if i and j denote two pairs of orthologous genes which are connected in both the human and the mouse networks (Figure ). Thus, the intersection network captures the coexpressed gene pairs conserved between human and mouse.
Figure 4 Human-mouse conserved intersection network. a) Procedure for computing the intersection network whereby conserved edges that link the corresponding orthologous genes in both species are preserved. b) Node degree (k) and c) clustering coefficient against (more ...)
The global characteristics of the intersection network are shown in Figures and . The intersection network node degree and C(k)
distributions are clearly similar to those of the species-specific networks as are the average clustering coefficient (<C
> = 0.4006) and average path length (<l
> = 6.89). The exponent that best approximates the power law of the node degree distribution is α
= 1.34 when a line is fitted to the logarithmically binned distribution (see Additional file 1
; Supplementary Figure 5) and α
= 1.01 using the maximum likelihood method. Taken together, these findings indicate that the global structure of the species-specific coexpression networks is preserved in the intersection network. However, the most striking feature of the intersection network is the small fraction of genes (~29–31%) and edges (~7–8%) that are conserved between the human and mouse networks (Table ). Accordingly, the average node degree is far lower (<k
> = 11.57) in the intersection network than it is in each of the species-specific networks.
Local conservation of the human-mouse intersection network
Several other factors also point to the local level divergence of the human and mouse coexpression networks. When the degrees (k) of nodes present in both the human and mouse networks were arranged into species-specific degree sequence vectors, only relatively low, albeit statistically significant (given the large number of observations), correlation (r = 0.27, P = 9e-149) was seen between species. In other words, a highly connected node (hub) in the human coexpression network is not especially likely to be a hub in the mouse coexpression network and vice versa. In addition, the human and mouse coexpression r-values for shared edges are not correlated at all (r = 0.03). Finally, there is no correlation between the principal eigenvector values of the human and mouse networks (r = -0.03), indicating that the dense areas of the networks do not overlap. Thus, whereas the global topological properties of the species-specific networks are highly conserved, the local architectures that underlie these topologies, in terms of the identities of the coexpressed genes pairs, are highly divergent.
The low level of conservation seen for the local network structures was unexpected, particularly, in light of the close similarity of the global topological properties, and suggested substantial divergence of gene expression patterns between human and mouse orthologs. A series of controls were implemented to assess the meaning and robustness of these findings (see Additional file 1
). These controls included comparison of networks constructed separately from experimental and biological replicate data sets, and analysis of network conservation for subsets of the data with different experimental variances. The results of these controls indicate that the majority of the local divergence between human and mouse coexpression networks does not result from experimental noise. In addition, lowering the PCC
threshold used to define edges in the coexpression networks does not result in a substantial increase in the fraction of edges conserved between species (see Additional file 1
; Supplementary Figure 9a).
The high divergence of coexpressed gene pairs between human and mouse detected here is consistent with previous studies that have shown substantial divergence of the expression profiles for human and mouse orthologs [5
]. Indeed, when the expression profiles were directly compared for the 9,105 human-mouse orthologous gene pairs studied here, the average PCC
, while positive, was fairly low and not statistically significant (average PCC
= 0.22, Student's t = 1.15, df = 26, P
Functional coherence of gene coexpression networks
The coexpression networks described here are analytical constructs that are intended to capture the complexity of the relationships among thousands of gene expression patterns. Given the significant rapidly evolving (and perhaps neutral) component in the evolution of these networks, it is not a trivial question whether or not (and to what extent) coexpressed gene pairs represent coregulated and/or functionally interacting genes. To assess the biological relevance of these networks, Gene Ontology (GO) functional annotations were mapped onto the network nodes and the functional affinities of linked genes were explored. The first question addressed was whether, and to what extent, coexpressed genes are functionally related. The structure of the GO graph can be exploited to derive measures of functional similarity between pairs of genes [23
]. Pairwise similarities between biological process terms were computed for all pairs of network genes associated with GO annotations, and these functional similarity data were then used to cluster genes by the UPGMA method. The resulting lists of genes, ordered by function, were plotted on both axes of a matrix containing all pairwise gene expression profile correlations. When these correlations (r
-values) are color coded, it allows for a visual inspection of the functional relationships, or lack thereof, among coexpressed genes (Figure ). The functional coherence of the human-specific, mouse-specific, and intersection networks is clearly revealed by the off-diagonal block color-structure of plots (Figure , , and , respectively). In each of these networks, there are numerous clusters of functionally related genes that are demonstrably enriched for coexpressed pairs. For comparison, the inset of each plot shows a negative control with genes ordered randomly along the matrix axes, and accordingly, no apparent block color-structure for the correlation values.
Figure 5 GO similarity versus gene profile correlation matrix. Genes are plotted along both axes of the matrices. Genes were clustered according to the pairwise similarity between their GO biological process annotation terms for the a) human-specific coexpression (more ...)
In addition to this visual evidence for the functional affinity of coexpressed gene pairs, genes linked in the coexpression networks were found to have significantly higher GO similarities, on average, than seen for all pairs of genes (Table ). In addition, statistically significant positive correlations were detected between the pairwise coexpression r-values and GO similarity values for all three coexpression networks, indicating that more tightly coexpressed gene pairs tend to be more functionally related (Table ). The correlation was significantly greater for the intersection network than for each of the species-specific networks.
Average GO similarity for mammalian gene coexpression networks versus average GO similarity for all gene pairs
Correlation (r) between pairwise GO similarity and pairwise gene expression profile r-values
Based on visual comparison of the off-diagonal color structure of the plots shown in Figure versus Figure , there appears to be a stronger relationship between function and coexpression for the genes that are found in the conserved human-mouse intersection network than for the human or mouse networks. This suggests that the expression patterns of gene pairs that are tightly functionally coupled are more prominently constrained by purifying selection than those of more loosely functionally associated genes. Statistical comparison of the species-specific versus intersection network supports this interpretation. Pairs of coexpressed genes in the intersection network are significantly more functionally similar, on average, than pairs of coexpressed genes in the species-specific networks (Table ). A cumulative frequency distribution of GO similarities between pairs of mammalian genes clearly shows that genes linked in the intersection network are more functionally similar than genes linked in the species-specific portions of the network, which are in turn more similar than all pairs of genes irrespective of their expression patterns (Figure ). Finally, there was a significantly stronger dependence between coexpression and function for gene pairs in the intersection network compared to the species-specific pairs as indicated by a comparison between expression profile correlations and GO functional similarity (Table ).
Average GO similarity for species-specific mammalian gene coexpression networks versus average GO similarity for the conserved human-mouse intersection network
Figure 6 GO biological process semantic similarity cumulative frequency distributions. Distributions [Pr(X≤x)] of GO term similarities are shown for all human and mouse gene pairs, for pairs of genes linked in the species-specific coexpression networks (more ...)
Correlation (r) between pairwise GO similarity and pairwise gene expression profile r-values
Network clusters and biological function
The mammalian coexpression networks analyzed here are tightly clustered, as indicated by the high average clustering coefficients (Table ), and display modular, albeit not necessarily hierarchical, structure (see Additional file 1
; Supplementary Figure 6), (Figure and Figure ). In light of the presence of compact network substructures, further functional interrogation was performed by decomposing the networks into tightly linked clusters of genes. The genes in these clusters were then evaluated for the presence of statistically overrepresented GO terms, which would indicate functional coherence for the respective group of genes. In a number of cases, there are striking relationships between network substructure, gene function and coexpression. A detailed table showing resolved network clusters, overrepresented GO terms and gene ids along with their expression patterns is presented online [24
]. Two of the most prevalent functional classes that show clear function-expression coherence are genes involved in sexual reproduction and host immune response. Examples of two such clusters are shown in Figure . This observation is notable because genes of these two functional classes are also prone to evolve under the influence of positive, diversifying selection [25
]. This is thought to be due to sexual selection, in the case of reproduction related genes [27
], and to evolutionary arms race between hosts and their pathogens for immune response genes [28
]. It might prove to be the case that changes in gene expression patterns for such genes also have pronounced evolutionary significance.
Clusters of tightly coexpressed and functionally coherent genes. Examples of clusters involved in spermatogenesis (a) and host immune response (b) are shown along with their tissue-specific expression patterns.
Consistent with the apparent increased functional coherence of the intersection network, the correspondence between network clusters, GO term overrepresentation and expression patterns is significantly more pronounced for the conserved intersection network than for the human and mouse species-specific networks. Thus, 38% of clustered genes from the intersection network mapped to overrepresented GO terms compared to 13% of human-specific (χ2 = 85.1, P = 2.8e-20) and 18% of mouse-specific network genes (χ2 = 43.0, P = 5.5e-11).