Proximate genes are coexpressed in the zebrafish genome
In order to study the coexpression of proximate genes, we analyzed 100 expression datasets derived from Affymetrix microarray experiments. We use the Pearson correlation coefficient (R) of two genes to measure the level of their coexpression. The mean R of all the neighbouring gene pairs in our dataset is 0.07468 (with standard error 0.00424). This mean value is statistically significant (with p-value 0.0001) as it is +11.4 standard deviations from the mean R in a randomized genome. In a randomized genome with the same genes and expression values, the mean R is only 0.03086 (with standard deviation 0.00384) (Figure ). Tandem duplicated genes have identical functions and hence are often highly coexpressed. To eliminate the effects of tandem duplicates on this coexpression study, we removed all members except one in each tandem gene cluster and redid the analysis. After removal of tandem duplicates, the mean R became 0.06844 (with standard error 0.00426). It is slightly smaller than the value when all genes are included in the analysis, but still significant (with p-value 0.0001, +9.7 standard deviations from the random mean) (Figure ).
Figure 1 Distribution of 10,000 mean R values calculated from randomized genome. Each plot shows the distribution of 10,000 mean R values. Each mean R value is calculated by first randomly permuting the gene order of the genome, and then averaging the R values (more ...)
Surprisingly, the mean R
is almost identical for zebrafish and Arabidopsis
] when the gene order is randomized, which is about 0.03 with standard deviation 0.004). The underlying cause behind this is unclear. The positive mean value rather than zero could be an effect of the coexpression of the housekeeping genes as suggested in Williams and Bowles [15
]. The reason for the identical mean R
when these two genomes are randomized is probably either that the housekeeping genes show common patterns of expression in different genomes, dominating the mean R
value, or alternatively, that there is a constant bias or weak autocorrection between all genes in each microarray dataset (see Additional file 1
for detailed discussion and also [11
To further explore the coexpression of proximate genes, we partitioned the genes into non-overlapping blocks of k (3 ≤ k ≤ 20) physically adjacent genes according to their start position. For each gene block, we calculated a mean R of the coexpression values; then the mean of all the mean Rs is calculated and plotted in Figure . The degree of coexpression first decreases and then becomes stable when the block size k increases from 3 to 20. To verify the significance of the coexpression degree of the genes for each block size, we compared them to what would have been obtained if the genes had been rearranged in each of the following three ways: (1) randomly permuting the gene order over the entire genome; (2) randomly permuting the order of genes within each chromosome; and (3) randomly permuting the order of non-overlapping blocks of 3 consecutive genes. The last rearrangement is used to examine whether the coexpression degree in the larger blocks are dominated mainly by genes that are separated by only one or two genes. The analyses show that there is a significant difference in degree of coexpression between actual and randomized genome (Figure ). Finally, we remark that the start point for partitioning the genes into non-overlapping blocks has little effect on the analysis presented above because of the way we calculate the mean coexpression value of genes within a block.
Figure 2 Mean of pair-wise R values in blocks of size 3 to 20 (▲), shown with standard error. This is compared to the mean of 100 values obtained similarly, each from the same analysis after a random permutation of: (1) the gene order of the entire genome (more ...)
Coexpression of genes within GO classes
Genes are classified into different classes according to the biological processes they are involved in the GO database [22
]. We evaluated the mean R
of the coexpression value of a pair of genes within a GO class that contains 20 or less genes. There are 853 such GO classes. The mean R
is higher than 0.13, 0.30 and 0.50 in 50%, 25% and 10% of these classes, respectively (see Additional file 2
). Thus, genes within a GO class are highly coexpressed as reported in other genomes.
Distance and coexpression
In this analysis, the genes within 50 kilo-base pairs (kbp) from each other are collected, mean R values calculated, and the distance between them rounded to the nearest number in 0, 2000, 4000..., 50000. Figure plots the mean of these R values against their intergenic distance. There is a clear negative correlation between intergenic distance and degree of coexpression in both the full dataset (regression line R2 = 0.50) or after removal of tandem duplicates (regression line R2 = 0.46). The correlation is most significant for the gene pairs of distances between 10 kbp to 40 kbp (R2 = 0.79, p < 0.005 for the full dataset, R2 = 0.83, p < 0.005 for the dataset without tandem duplicates). Outside of this range, this correlation seems to be absent.
Figure 3 Mean R of gene pairs of up to 50 kbp apart. Gene pairs of up to 50 kbp apart were binned according to their intergenic distance, shown with regression lines. (A) is from the full dataset, whereas (B) from the resulting dataset after tandem duplicates (more ...)
Gene orientation and coexpression
Genes can be transcripted in two directions, denoted by (→) and (←). Thus, two genes can have: divergent transcription (← →), convergent transcription (→ ←), or parallel transcription (← ← or → →) orientation. We partitioned all the gene pairs into three groups according to transcription orientation. For each orientation class of gene pairs, we calculated the mean R value. Regardless of whether tandem duplicates are removed or not, gene pairs with parallel orientation showed the highest degree of coexpression, in both average as well as median values, while gene pairs with convergent orientation showed the lowest degree of coexpression (Table ). Kruskal-Wallis tests confirmed this effect of orientation (p = 0.0267) for the entire dataset, but the effect is insignificant after the removal of tandem duplicates (p = 0.2894). This is presumably because most tandem gene pairs have parallel orientation.
Descriptive statistics for pair-wise comparison of neighboring genes according to orientation of transcription
Positional clustering and the level of coexpression
We have seen from the analyses that a correlation exists between intergenic distance and degree of coexpression. To further the study in this direction, we examined coexpression among the genes in a positional cluster. We adopted the neighbourhood model to identify positional gene clusters and evaluated positional clustering using a method proposed by Li, Lee and Zhang [19
]. In the neighbourhood model, two genes are in a cluster if and only if there is a series of genes between them such that the distance between two adjacent genes in the series is less than a specified distance (D
The significance of a positional cluster depends on the value of D
, the number of genes it contains, and the gene density of its vicinity. In this study, we set D
to be 25K, which is one-eighth of the average distance between genes (206K for the full dataset, 213K after removal of tandem duplicates. c.f. Table ). These clusters are described in Additional file 3
Neighbouring genes within the clusters we identified tend to show a higher degree of coexpression than neighbouring pairs that are not. This observation is consistent with the observation that is mentioned above: a pair of genes within shorter intergenic distance tends to be more coexpressed. Table shows that with only one exception, the mean R values for clusters of all sizes are higher than 0.07468, the value of mean R of all neighbouring gene pairs in the whole dataset.
Mean of R values for all neighboring gene pairs found within some cluster of size d (d = 2, 3 ..., 7, > 7).
Among the identified positional clusters, there are ten highly significant clusters containing eight or more genes, listed in Table . They include hox gene clusters hoxba
and hoxca on Chr23
], olfactory receptor gene cluster or1
, cytochrome P450 aromatase
gene cluster cyp2j
, and major histocompatibility complex (Mhc
) gene cluster on Chr19
]. The other five clusters contain a mix of genes from different GO classes. Although these genes have not yet been investigated, they are likely structurally and functionally unrelated. In each of the gene clusters except for three eight-gene clusters on Chr4
, genes have high level of coexpression.
The positional clusters which contain at least 8 genes. The xxx stands for genes with unknown functions.
It is proposed that evolutionary selection organizes genes according to their biological function so that their expression can be co-ordinately regulated. To test this hypothesis, we use the GO database [22
] as a source of annotations of biological processes. The number of clusters formed by genes in some GO class is listed in Table (see Additional file 4
for details). The number is very much reduced when compared to the total number of clusters for each size. Thus, most positional gene clusters we observe are likely not composed of genes with similar biological functions. As suggested by Spellman and Rubin [10
], the above hypothesis is not supported.
Number of positional gene clusters found with intergenic distance D = 25K.
Finally, we investigated whether there is a correlation between the mean R of gene pairs and p-value for a positional cluster. With D = 25K, we considered all the pairs of the neighbouring genes in the same cluster. We divided the gene pairs into seven categories according to the p-value of the clusters to which the gene pairs belong. These seven categories correspond one-to-one to the following intervals of p-values: 0~10-6, 10-6~10-5, 10-5~10-4, 10-4~10-3, 10-3~10-2, 10-2~10-1, 10-1~1. To simplify presentation we consider (base 10 logarithm) -lg p-value instead of p-value, and use the intervals: 0~1, 1~2, 2~3, 3~4, 4~5, 5~6, > 6. We calculated the mean R of the neighbouring gene pairs in each category and observed a significant correlation between -lg p-value and the degree of coexpression of neighbouring gene pairs, either using the complete dataset (Figure ) or the dataset after tandem duplicates are removed (Figure ). This correlation is extremely significant for gene pairs that are transcripted in the parallel orientation. The mean R value is as high as 0.5088 (with standard error 0.0642) for the complete dataset and 0.3228 (with standard error 0.1330) even after tandem duplicates are removed. We also observed that at low p-value (high -lg p-value), more gene pairs in the identified clusters are transcribed in the parallel orientation, even with tandem duplicates (Table ). We examined a correlation between -lg p-value and neighbouring gene distance to find if such a high correlation can be explained with intergenic distance. No such correlation was found (Figure ).
Figure 4 Mean R of neighboring gene pairs in different -lg p-value intervals. Mean R values of neighboring gene pairs in -lg p intervals. All p-values were calculated with D = 25K. Gene pairs grouped into parallel, divergent, and convergent orientations are plotted (more ...)
Results from the same analysis as in Figure after tandem duplicates are removed.
Number of neighboring gene pairs found in clusters in different -lg p-value intervals (D = 25K).
Average distance between neighboring gene pairs in different -lg p-value intervals.