Co-expression and co-regulation
We downloaded all publicly available micorarray data for S. cerevisiae from the Stanford Microarray Database [11
]. We then downloaded Lee et al's transcription factor binding data from their website [13
]. We restricted our analysis to those genes in the microarray data set which had at least one transcription factor which bound upstream of the gene according to the binding data. This left us with data on 2284 genes across 611 arrays.
To measure similarity in expression we calculated the pairwise correlation coefficient between the mRNA expression profiles of all genes in our data set. Initially, there were 2,607,186 gene pairs in our analysis. If both genes in the pair shared the same promoter region, we excluded the pair from the analysis. After excluding these gene pairs, there were 2,606,473 gene pairs. For each pair of genes we also determined if there was a common transcription factor which bound to the promoter region of both genes. Figure shows the observed fraction of gene pairs which share a common transcription factor binder as a function of the correlation between the expression profiles of the two genes. The figure demonstrates that genes with strong, positively correlated expression profiles are much more likely to be bound by a common transcription factor than genes with less strongly correlated expression profiles. This effect is present, however, only at relatively high correlation coefficients. In order for two genes to have a 50% chance of sharing a common regulator, the correlation between their expression profiles must be 0.84. There were 168,994 pairs of genes which had a common transcription factor binder, but only 5,419 (3.2%) of these pairs had a correlation greater than 0.84.
Figure 1 Fraction of Gene Pairs Sharing a Common Transcription Factor Binder vs. Correlation Coefficient In this figure each point represents a ratio – the denominator is the number of gene pairs where correlation in expression profile is between x and (more ...)
As a control, we randomly permuted the mapping between genes and their promoter regions. As Figure shows, when the analysis described above was repeated using this permuted data, there was no relationship between correlation in expression profile and shared transcription factor binding. We also performed two additional permutation tests. For the first of these, we permuted the relationship between transcription factors and their target genes. For the second additional permutation test, we randomly permuted the expression profiles for each gene. No relationship between correlation in expression profile and shared transcription factor binding was seen when either of these additional permutations of the data was used in the analysis (data not shown). These controls provide evidence that the relationship between correlation and shared transcription factor binding observed in actual yeast data is extremely unlikely to be due to chance alone. Instead, it is a result of the co-regulatory mechanisms that do exist in yeast.
We also defined a measure of regulatory closeness, c(X,Y), which was designed to capture more distant regulatory relationships between genes. c(X,Y) is inversely proportional to the path length between genes X and Y in a graph where nodes represent genes and edges are drawn between each transcription factor and the genes to which it binds (see Methods). If two genes (X and Y) are regulated by two different transcription factors, but both of those transcription factors are regulated by the same transcription factor, then c(X,Y) will be high. Figure shows a plot of c(X,Y) versus correlation coefficient. Like Figure , this graph also suggests that co-expressed genes are likely to share common regulatory mechanisms. When this broader measure of co-regulation is used, the effect is apparent at less extreme correlation coefficients and also appears to be present for strong, negative correlations. A similar relationship is not seen when permuted data is used in the analysis. Note that when the mapping between genes and their promoter regions is permuted, the natural regulatory network of transcription factors is perturbed and c(X,Y) tends to be smaller for any given correlation in expression profiles. This same effect is also seen when the relationship between transcription factors and their target genes is permuted.
Figure 2 Regulatory Closeness vs. Correlation Coefficient Each point shows the mean regulatory closeness for gene pairs whose correlation in expression profile is between x and x+.01. The actual yeast data is represented by blue ● and the control data (more ...)
The above analysis included all genes in the data set, regardless of whether or not the genes themselves were transcription factors. We also looked specifically at the situation where one gene in the pair was a transcription factor. Ideker et al published examples of cases where the mRNA expression profiles of a transcription factor and the genes it regulated were strongly correlated [6
]. In our analysis, we did not observe that the correlation between two genes tended to be higher than average if one of the genes was a transcription factor which regulated the other gene. In fact, a strong correlation between a gene X and transcription factor A was more likely to be explained by the presence of transcription factor B which bound both A and X (data not shown).
Our analysis used data from 611 microarrays covering a wide variety of experimental conditions. In many cases, however, researchers have much smaller and more limited data sets. To investigate how the link between co-expression and co-regulation depended on the number and type of experiments in the data set we repeated the above analyses using only selected subsets of the microarray data. We created subsets by randomly selecting experiments and also by choosing experiments related to either cell-cycle, starvation or the stress response. We chose to examine cell-cycle, starvation and stress related experiments because they provided relatively large data sets and because they have been well studied in the past [14
]. By choosing different threshold correlation coefficient levels above which gene pairs are predicted to have a shared transcription factor binder, different levels of sensitivity and specificity can be obtained. For each data subset, we chose a threshold correlation coefficient r such that 75% of all gene pairs whose correlation in expression profile was greater than r had a shared transcription factor binder. We then determined what percentage of the total number of gene pairs sharing a common transcription factor had a correlation in expression profile greater than r. To state this in a different way, we evaluated the sensitivity of expression data for detecting shared transcription factor binding when positive predictive value was held constant at 75%.
Figure shows how r and sensitivity vary as a function of the number of microarray experiments in the mRNA expression data. The analysis illustrates several important points. First, as the number of microarrays in a data set decreases, a higher threshold correlation coefficient is required to achieve a given positive predictive value. Second, performance in finding pairs of genes with shared transcription factors initially improves as the number of microarray experiments increases, but this effect levels off after approximately 100 microarrays are used. Third, randomly selected data sets composed of arrays spanning a diversity of experimental conditions are more informative overall than similarly sized data sets relating to a single experimental condition.
Figure 3 Threshold Correlation Coefficient and Sensitivity vs. Number of Microarrays Labeled data points represent selected subsets as follows: A – starvation, B – cell cycle, C – stress response, D – all data. All other data points, (more ...)
Functional similarity vs. co-regulation
One major difficulty in testing the hypothesis that functionally similar genes are likely to be co-regulated is the problem of defining functional similarity. There is no commonly accepted methodology for doing this. Here we used the Gene Ontology (GO) classification system [17
] to define two measures of functional similarity. For our first measure of functional similarity, which we termed the minimum node count method, we determined for each GO term how many genes in our data set were annotated with that GO term. We termed this the node count for a GO term. For our purposes, a gene was considered to be annotated with a given GO term if it was directly annotated with the term or if it was annotated with a descendent of that term. Then for each pair of genes we found all GO terms that were annotated to both genes in the pair and determined the node count for each GO term. We took the smallest node count as our measure of functional similarity. Small minimum node counts thus represented significant functional similarity while gene pairs with larger minimum node counts were less similar functionally. As examples, the GO term "microtubule binding" had a node count of 5, while the GO term "binding" had a node count of 443.
For our second measure of functional similarity, we defined GO term levels. Level 1 GO terms were defined as terms whose distance from the root of the ontology was 1 (for example, direct child terms of "molecular function"). Using this method, "microtubule binding" is a level 5 GO term, while "binding" is a level 1 GO term. Two genes were considered to have matching level 1 GO terms if they each were annotated with a GO term that was a descendent of the same level 1 term. Levels 2 through 8 were similarly defined to represent increasingly similar levels of function.
Figure shows the fraction of gene pairs that share a common transcription factor binder as a function of minimum node count. Figure is similar, but uses GO term levels instead of minimum node count as the measure of functional similarity. Both of these measures are only proxies for true functional similarity. Although small node counts generally reflect significant functional similarity, the minimum node count method is misleading when large numbers of functionally similar genes are annotated with the same GO term. The ribosomal genes are an example of this, and these genes account for many of the outliers in Figure . Functional similarity as defined by GO term level is also problematic in that it is highly dependent on the complexity and depth of annotation in each ontology. Figure appears to suggest that the molecular function ontology is more useful for identifying co-regulated genes than the biological process ontology. However, this is likely due to the fact that higher level GO terms in the biological process ontology are generally broader and more inclusive than terms at the corresponding level of the molecular function ontology (as is suggested by the number of gene pairs which match at each level in the two ontologies). As an example, consider the level 1 GO term "cellular process" from the biological process ontology. Knowing that two genes are both involved in a "cellular process" is unlikely to provide much information about the regulation of the two genes. Nevertheless, despite the inherent flaws in each of these measures of functional similarity, use of either measure shows that genes with similar functional annotations tend to share common regulatory mechanisms.
Figure 4 Co-regulation vs. Functional Similarity as Defined by Minimum Node Count A, B and C were created using the biological process, molecular function and cellular component ontologies, respectively. The dark black lines are fit using the actual yeast data (more ...)
Figure 5 Co-regulation vs. Functional Similarity as Defined by Level of GO Term Match BP, MF and CC refer to the biological process, molecular function and cellular component ontologies, respectively. A) The number of gene pairs which have matching GO terms decreases (more ...)
The above analysis, which shows an association between functional annotation and regulatory mechanism, suggests that functional annotations can be useful in identifying co-regulated genes. This conclusion is confounded by the fact that knowledge of regulatory mechanisms may (and almost certainly does to at least some extent) influence functional annotation. If functional annotation was solely a reflection of existing knowledge about regulatory mechanisms, we would still observe an association between functional annotation and regulatory mechanism even though functional annotation would not be useful for identifying novel co-regulated genes. One of the limitations of our study is that we are not able to exclude this effect. Doing so would require an assessment of gene function which was unbiased by existing knowledge of regulatory mechanisms. As far as we know, such an assessment of gene function does not exist. We suspect, however, that the functional annotations from Gene Ontology do contain a significant amount of other information not based on prior knowledge of regulatory mechanisms. To the extent that this is true, the observed association between functional annotation and regulatory mechanism is indicative of the utility of functional annotations in identifying co-regulated genes.
Using co-expression and functional similarity to predict co-regulation
Figures and show how likely two genes are to be co-regulated, as estimated by the fraction of gene pairs sharing a common transcription factor binder, when correlation in expression profile and functional similarity are both taken into account. Both figures suggest measures of similarity in expression and function provide independent information about similarity in regulatory mechanism. Functional annotations are particularly helpful in identifying co-regulated gene pairs when there is an intermediate level of correlation (i.e. 0.5 < r < 0.8) between expression profiles. Genes with this level of pairwise correlation in expression are likely to share a common transcription factor binder only if they have similar functional annotations. As discussed in the previous section, our study may overestimate the utility of functional annotation to the extent that functional annotations merely reflect existing knowledge of regulatory mechanisms.
Figure 6 Co-regulation vs. Co-expression and Functional Similarity as Defined by Minimum Node Count A, B and C were created using the biological process, molecular function and cellular component ontologies, respectively. For each of the three ontologies, the (more ...)
Figure 7 Co-regulation vs. Co-expression and Functional Similarity as Defined by Level of GO Term Match A, B and C were created using the biological process, molecular function and cellular component ontologies, respectively. For each of the three ontologies, (more ...)