Several studies (Hamilton [1
]) have found stronger correlation between the expression levels of genes that are located close to each other on the genome than between those of distant genes: when gene expressions of many genes are measured for multiple tissue samples, for example using microarray technology, adjacent genes are sometimes found to be consistently up- or downregulated in a subset of the tissue samples.
Gene expression is influenced by many factors (for a review, see Orphanides[3
]), many of which could influence the correlation between the expression of two genes in general, and that between two adjacent genes in particular. Of particular interest are chromatin domains
. DNA can exist in either one of two states: a condensed state, termed heterochromatin, which is broadly inaccessible to transcription (although there are exceptions (Orphanides[3
])), and an active state, termed euchromatin. A chromatin domain
(a segment of DNA which, in a given cell at a given moment, is either entirely euchromatin or entirely heterochromatin) typically spans several genes (Roy[4
]). Therefore, one would expect the expressions of two adjacent genes to tend to be positively correlated, at least if it was possible to measure transcription in individual cells. If the chromatin state was completely random (Jackson[5
]) suggested a dynamic equilibrium, where chromatin fluctuates, to some extent randomly, between the two states), the effect of chromatin domains would vanish when gene expression is measured in pools of many cells, as with microarray technology. However, there is ample evidence for non-randomness. For example, chromatin states tend to be preserved after cell division (Orphanides[3
]). And Cho[6
] demonstrated that the states of chromatin domains in yeast are related to the cell cycle.
In addition to the chromatin theory, several other explanations have been suggested for the apparent correlation between the expressions of adjacent genes. Several authors (Cohen[7
]) have noted that divergent gene pairs show stronger correlation than tandem and convergent pairs, possibly because divergent pairs share an Upstream Activation Sequence. Lercher[9
] found that many of the co-expressed adjacent genes in Caenorhabditis elegans
are either operons or homologues (see also Llorente[10
] and Rossfll]), and it has been suggested that evolution has arranged for functionally related genes to be located close to each other, either in order to promote consistent inheritance (Bleiweiss[12
]), or in order to benefit from the correlation accounted for by the chromatin domains (Cohen[7
] found a nonrandom distribution of the chromosomal location of genes with high expression level in testis and ovaria in Drosophila. Jackson[5
] suggested that the location of a gene in the nucleus plays a role for its transcription, in relation to gradients of the concentration of transcription factors.
Finally, since the action of a transcription factor on a promotor gets weaker with distance, genes belonging to the same pathway should show stronger correlation if they are located close to each other (Dorsett [14
]). Due to this abundance of alternative theories, a study of gene-expression correlations should be designed in a way that makes it possible to distinguish correlation structures predicted by one model from those predicted by other models. The same applies to the statistical analysis techniques used.
An important consequence of the evolution-based theories is that they predict a consistent coregulation structure. Suppose that two genes (in this case, two adjacent genes) are co-regulated because, for example, they participate in the same pathway. They would, then, show a strong correlation because they would be co-regulated in all tissue samples. This need not necessarily be the case with the chromatin domain model: the segments of euchromatin in one tissue sample may overlap with those in another tissue sample. This latter scenario, we call an inconsistent coregulation structure. With consistent coregulation, adjacent gene pairs will show either strong positive correlation, or they will be uncorrelated. With inconsistent coregulation, all adjacent gene pairs will show a modest positive correlation.
In a microarray analysis of gene expressions in 35 pools of drosophila embryos and 54 adult drosophilae (Spellman and Rubin[15
], reviewed by Oliver[16
]), it was shown that adjacent genes with correlated expression levels tend to cluster. The method they used to demonstrate this was the following: let w
be a fixed window size, e.g. 10. For each window of w
adjacent genes, the average pairwise Pearson correlation coefficient within the window was computed. If that measure was found to be significant at, say, 1 - α
= 0.999 (the p-value was estimated in a permutation experiment), all the genes in the sequence were tagged. Doing this for all windows (they were allowed to overlap), the total number of tagged genes was counted. Then the experiment was repeated with shuffled genes (i.e., as it would behave in the absence of positionally related correlation), and the number of tagged genes in the shuffled experiment was subtracted from the number of tagged genes in the original experiment. This difference (called "net genes") grows with window size and starts plateauing for a window size of approximately 10. Spellman and Rubin interpreted this as evidence for gene interaction within regions of approximately that size.
One problem with the above method of analysis is that the increasing number of "net genes" would occur even without direct interactions between genes separated by up to ten positions. As shown in figure , the analysis gives similar results when applied to simulated data from a normal distribution, in which an autocorrelation of AC = 0.10 or 0.05 was imposed artificially. So we cannot, on the basis of the analysis described above, reject the hypothesis that the data arose from a simple first-order autocorrelation process, in which no clustering of correlated genes exists. It is true that gene-pairs with high correlation form clusters: the autocorrelation of Pearson's R for adjacent genes is 0.1, with a standard error 0.01. However, this can be explained by the fact that genes that tend to correlate strongly with other genes in general (for example because of low measurement noise) tend to correlate with both their neighbors. If one eliminates that confounder by looking at non-overlapping gene pairs only, the autocorrelation vanishes (0.01, standard error = 0.01). Another way of showing this is by means of cross-tables. We divided the adjacent gene-pairs into three groups: positively correlated pairs(R>0.7), negatively correlated (R< -0.7) and non-correlated. (The threshold of 0.7 was suggested by Cohen[7
]). If the correlated gene-pairs were clustered, one would expect that a gene-pair belonged to the same group as the next gene-pair more often than would happen by chance. This is indeed the case when overlapping gene-pairs are considered: 627 gene pairs out of 12949 (4.8%) had an R > 0.7 while the next (overlapping) gene pair also had an R > 0.7. This is 2.22 times more than what we expected due to chance alone. However, the same was observed when only one of the two overlapping gene-pairs was was a neighbor pair and the other was a random pair (if the genes were labelled ABCD...Z, a strong correlation between A and B predicted a strong correlation between B and C but also between B and X, where X is a random gene). But when non-overlapping adjacent gene pairs were considered (say, AB versus CD), the contingency was 332 out of 12878(2.6%) which is only 1.18 times more than expected due to chance. So the apparent clustering of correlated gene pairs is mainly due to overlap rather than to adjacency.
Figure 1 Net genes for simulated data. Number of genes that contribute to a high moving average of Pearson R in simulated data, as a function of the size of the windows used for computing the moving average. The shapes of the curves are similar to the findings (more ...)
On the other hand, it is clear that there is some higher order correlation structure in Spellman and Rubin's data. This can be seen by computing the average correlation coefficient for subgroups of the gene pairs, based on their physical distance (table ) – it decreases much slower with distance than would a first-order process. Hence, the question remains how the correlation structure should be modelled and analyzed. In this paper, we present a method to separate
Average correlation between gene pairs of different physical distances. The distances are minimal distances in bases.
A) Correlation of gene expression that can be attributed to consistent coregulation, from
B) The uniform correlation expected under an hypothesis of inconsistent corregulation.