Identification and Characterization of Head-to-Head Gene Pairs in Human, Mouse, and Rat Genome
A total of 1,262 human head-to-head gene pairs with their TSSs separated by less than 1 kb were identified from 26,813 human genes according to the genomic mapping data from the National Center for Biotechnology Information (NCBI) (see Table S1
, “H2Hpairs” sheet, for detailed information of each pair). The mitochondrial genome was ignored in this work since its organization is far more compact than that of the nuclear genome. Given a situation that one gene could be covered by two pairs simultaneously due to a close arrangement of two genes (Table S1
, “GenesInMultiH2H” sheet), the 1,262 pairs involve a total of 2,515 genes. That is, 9.4% of human genes are organized in a head-to-head configuration. Similarly, 1,071 and 491 head-to-head pairs, corresponding to 2,130 (8.2%) and 968 (4.4%) genes, were identified from 25,841 mouse genes and 21,977 rat genes, respectively (see Tables S2
for detailed information).
To characterize structural features of head-to-head gene organization in mammalian genomes, we determined the distributions of TSS distance of the human, mouse, and rat head-to-head gene pairs. The three species show similar distribution plots (), where four columns representing pairs with TSS distance of 1 to 400 bp contain the majority (62.36%, 64.15%, and 55.19% for human, mouse, and rat, respectively) of the total number of pairs, and the peak is always the group with 101- to 200-bp distance (see Table S1
, “DistHist” sheet for detailed data). The obviously lower number of rat head-to-head pairs and their relatively flat profile of the distance distribution might be attributed to the incomplete 5' UTR information and thus the imprecise calculation of TSS distances, which will be further explained in the Discussion
Distribution of TSS Distance of Head-to-Head Genes
All head-to-head gene pairs identified in this paper were mapped to the whole human genome (Figure S1
). Also, the relationship between head-to-head pair ratios and gene densities of each chromosome was examined statistically (). The pair ratio was obtained by dividing the number of genes involved in head-to-head pairs (h2h gene number) by the total gene count in a certain chromosome. The Pearson correlation coefficient indicates that there is a significant linear relationship between pair ratio and gene density at p
< 0.05 (), contradicting the previous report based on the data from Chromosomes 21 and 22 [9
]. A significant linear relationship was also observed in mouse genome (see Table S2
, “DistHist” sheet).
Distribution of Head-To-Head Gene Pairs on Each Chromosome
Relationship between Head-to-Head Gene Pair Ratio and Gene Density
Phylogenetic Analysis of Head-to-Head Gene Organization in Vertebrate Genomes
As there is a common profile of the distance distribution of head-to-head gene pairs for human, mouse, and rat, we attempted to determine if the head-to-head gene organization is conserved during vertebrate evolution. The Fugu rubripes, Gallus gallus
(chicken), and human genomes were selected for this analysis. Fugu
has the shortest known genome (approximately 365 Mb) of any vertebrate species, around one eighth of the size of the human genome [17
]. The chicken has a genome of 1.2 Gb, approximately 40% of the size of the human genome and is the premier nonmammalian vertebrate model organism.
First, we identified orthologous gene pairs that remained consecutive with the same relative orientation in both human and Fugu
. To detect orthologous genes in human and Fugu,
37,439 predicted Fugu
peptides from the Fugu
Genome Project were compared to 33,869 human peptides from Ensembl. According to the filtering criteria described by Aparicio et al. [17
], 10,209 human-Fugu
orthologous genes were determined. We mapped these genes to the human genome, and extracted 4,225 human consecutive pairs. Of these, 760 pairs (18.0%) were found to be consecutive with the same relative orientation in the Fugu
genome, which represents gene pairs with conserved linkage between human and Fugu
(). This proportion is comparable to Dahary et al.'s report [18
Conservation of Gene Pair Organization between Human and Fugu
Then we examined the conservation of head-to-head gene organization. Of the 4,225 human consecutive pairs with orthology in Fugu,
348 show the head-to-head organization, of which 83 (23.9%) keep the same organization in Fugu
(). We used gene pairs that are consecutive and transcribed from the same strand in human as a control set (denoted “same-strand”). Only 15.2% (285 of 1,875) of the “same-strand” pairs in human have the same organization in Fugu
(). These data indicate that head-to-head gene pairs tend to maintain their gene order significantly more than the background (total) and the control (same-strand) (p
-value <5 × 10−3
, by Fisher's exact test). Considering that the probability of rearrangement could depend on the distance between a pair of genes in the ancestral genome [19
], we extracted 740 “same-strand” human pairs with an average distance comparable to that of the 348 head-to-head pairs to exclude the possibility that the observed rearrangement differences between head-to-head and “same-strand” pairs might be caused by differences in their original distance. Still, only 13.7% “same-strand” pairs had their gene order and orientation conserved () (see Table S4
for detailed information).
It is known that the Fugu
genome is highly compressed and the intergenic regions are very short compared to higher vertebrates [17
]. To check if head-to-head gene organization is conserved enough to influence the gene-distance expansion, we calculated genomic distances of gene pairs with human-Fugu
linkage in human and Fugu,
respectively. Due to the unavailability of full-length information for the Fugu
genes, genomic distance was defined as the absolute value of the distance between protein-coding regions. For the entire group of 760 pairs with human-Fugu
linkage, the average distance between a pair of genes in human was 8.90-fold larger than that in Fugu,
which is in accordance with the difference between human and Fugu
in genome size (). The “same-strand” group gives similar results. In contrast, only a 3.81-fold difference was observed for head-to-head gene pairs, with an average distance of 7.6 kb in human and 2.0 kb in Fugu
(median, 1.3 kb and 1.6 kb, respectively) (). These results suggest a negative selection on the separation of head-to-head gene pairs, implying the ancestral existence of this gene organization.
Genomic Distances of Gene Pairs with Human-Fugu Linkage
Furthermore, we analyzed the conservation of head-to-head gene organization between human and chicken genomes. By comparing 28,416 chicken peptides from Ensembl to 33,869 human peptides, 12,136 human-chicken orthologous genes were identified and mapped to human and chicken genomes. Then, 5,834 human consecutive pairs with orthology in chicken were extracted; of these, 3,490 pairs (59.8%) have conserved linkage between human and chicken (), which is much higher than between human and Fugu
(18.0%) due to the closer phylogenetic relationship between human and chicken. Of the 5,834 human consecutive pairs, 384 show head-to-head organization, from which 264 (68.8%) keep this organization in chicken; in comparison, only 56.3% (1,491 of 2,646) of the control set, or “same-strand” pairs in human, are consecutive in the same strand in chicken (), indicating that head-to-head gene pairs significantly tend to maintain their gene order (p
-value <5 × 10−3
, by Fisher's exact test). For the same reason as above, we analyzed a group of 912 “same-strand” pairs that have an average distance comparable to that of the 384 head-to-head pairs and found that 60.5% (552 of 912) “same-strand” pairs had their gene order and orientation conserved, which is consistent with the background (59.8%) (see Table S5
for detailed information).
Conservation of Gene Pair Organization between Human and Chicken
We also calculated the genomic distance of each gene pair with human-chicken linkage in both human and chicken. For the entire group of 3,490 pairs, the average distance between genes was 2.89-fold larger in human than in chicken and similar to the “same-strand” group (2.93-fold), which is consistent with the difference between human and chicken in genome size (). In contrast, only a 1.59-fold difference was observed for head-to-head gene pairs ().
Genomic Distances of Gene Pairs with Human-Chicken Linkage
In addition, we calculated the genomic distances of gene pairs with human-chicken-Fugu
linkage (Table S6
). For the entire group of 325 pairs, the average distance between genes in human was 2.87-fold larger than in chicken and 9.97-fold larger than in Fugu
(), which is comparable to the difference between human, chicken, and Fugu
in genome size. The “same-strand” group again gives similar results. However, the average distance between head-to-head genes in human was only 1.25-fold larger than in chicken and 3.68-fold larger than in Fugu
(). All of these data suggest the conservation of head-to-head gene organization during vertebrate evolution and thus the functional importance of this organization.
Average Genomic Distances of Gene Pairs with Human-Chicken-Fugu Linkage
Expression Analysis of Human Head-to-Head Gene Pairs
The existence of a bidirectional promoter or potential shared cis
-elements in a head-to-head gene pair raised the question about the transcriptional coregulation of the two involved genes. To investigate the transcription correlation between head-to-head genes, we mapped human head-to-head pairs to three human microarray datasets, E-MEXP-101, E-MEXP-230, and Jurkat (see Table S7
for original data), and obtained expression data for 369, 304, and 308 gene pairs in the three datasets, respectively. Then, we calculated the Pearson correlation coefficient of all gene pairs in each dataset independently (Table S8
, “allH2H” sheet) and drew three distribution plots of correlation coefficient (Table S9
, “allH2H” sheet). It was surprising that the expression correlations showed bimodal distributions with two peaks corresponding to positive and negative correlations, respectively, as this is apparently different from the previous report of a Gaussian distribution slightly shifted in the positive direction [2
]. To exclude the possibility that a positive correlation of a gene pair in one experiment may cancel out a negative correlation in another experiment, we obtained an average distribution () by averaging the three distributions instead of averaging the correlation of each gene pair. It is noticeable that the average distribution is still a bimodal one with a large positive peak and a small negative peak ().
The Bimodal Distribution of the Expression Correlation between Head-to-Head Genes
Then we evaluated the significance of each correlation at p
< 0.05 (Table S8
, “allH2H” sheet). It was shown that among a total of 549 head-to-head pairs with available microarray data, 199 (36.2%) pairs show exclusively significant positive correlations, and 94 (17.1%) show exclusively significant negative correlations, according to at least one microarray dataset. Additionally, it is interesting that 49 pairs (8.9%) display positive or negative correlation depending on the condition of microarray experiments, indicating that alternative mechanisms may be involved in the transcriptional regulation of some bidirectional promoters. Considering that some of the 549 pairs have corresponding data in only one or two microarray datasets, but not all three datasets, the real proportion of alternative correlation could be higher than presented in this report. Overall, at least 62.3% of head-to-head genes show significant expression correlation. The negative correlation and alternative correlation were underestimated by previous studies [2
Functional Analysis of Human Head-to-Head Gene Pairs
All of the following functional analyses were based on Gene Ontology (GO) [21
] annotations for head-to-head genes according to the association information provided by NCBI Gene Database (ftp://ftp.ncbi.nlm.nih.gov/gene
). Of the 2,515 genes involved in the 1,262 human head-to-head pairs, 1,160, 1,019, and 1,075 genes were directly annotated by “biological process,” “molecular function,” and “cellular component” GO subsystems, respectively (Table S10
, “all_DirectAnnotation” sheet). When both genes of a head-to-head pair are annotated by GO, the pair is denoted as an “annotated pair.” Of the 1,262 pairs, we obtained 267, 205, and 318 annotated pairs in the three subsystems respectively. As is mentioned in Materials and Methods
, any direct annotation is generalized to all ancestor terms up to the root terms in our analyses, and “annotation” is meant as “general annotation” in the following context.
In order to determine whether head-to-head genes statistically tend to perform similar functions, we evaluated functional similarities for annotated head-to-head pairs using the Resnik semantic measure. As is shown in , the distribution of functional similarities for these pairs significantly shifts to larger values relative to those for random pairs, confirming the cofunction tendency observed in individual experiments. Since p-values by the Kolmogorov-Simirnov test are 0.0085 for “biological process,” 0.0126 for “molecular function,” and 4.2 × 10−9 for “cellular component,” respectively, head-to-head gene products are more likely to perform roles in the same cellular component, compared to the other two subsystems.
The Distribution of Functional Similarities for Head-to-Head Gene Pairs
Then we set out to find out the GO terms which represent cofunctions of head-to-head pairs, or the functions whose associated genes tend to be organized in the head-to-head manner. Using a binomial probability model described in Materials and Methods
, we obtained 22, eight, and 15 significant cofunctions () in the “biological process,” “molecular function,” and “cellular component” subsystems, respectively, at a significance level of 0.01 (already adjusted for multiple testing error with the Bonferroni method). By merging the terms which point to closely related functions (see figures in the latter three sheets of Table S10
for the relationships of the cofunctions in each GO subsystem), we proposed that genes involved in functions including metabolism, chromosome organization and DNA packaging, anion transport, nucleic acid binding, catalytic activity, intracellular and organelle components, protein complex, collagen type IV, and so on, are more likely to be organized in the head-to-head configuration.
Significant Cofunctions Associated with Head-To-Head Gene Pairs in the “Biological Process” (BP), “Molecular Function” (MF), and “Cellular Component” (CC) Subsystems
To check the expression correlation between those head-to-head genes coding for similar functions, we extracted the expression correlation coefficients of the 282 pairs associated with the above 45 significant cofunctions (see Table S8
, “cofunctionH2H,” sheet for details of expression correlation analysis; see the latter three sheets of Table S10
for association between cofunctions and gene pairs). Essentially, the expression correlation of head-to-head genes with cofunction is still characterized by bimodal distributions similar to the one shown in (Table S9
“cofunctionH2H” sheet). According to the Pearson correlation test, 80 (36.7%) and 45 (20.6%) pairs of the 218 pairs with available microarray data show significant positive and negative expression correlations, respectively, and 30 pairs (13.8%) display positive or negative correlation depending on the conditions of the microarray experiments. Overall, 71.1% of the cofunction pairs are significantly correlated, which is somewhat higher than that of background head-to-head pairs, 62.3%. It is interesting to note that the proportion of the third type (13.8%), alternative correlation, is higher than that for background (8.9%). These data suggest that the head-to-head genes coding for similar functions have stronger expression correlation - especially alternative correlation.
Here we focused on more specific GO terms rather than the terms with limited information content such as “metabolism,” even though they might have very small p
-values. Five DNA packaging-related terms, including “nucleosome assembly,” “chromatin assembly or disassembly,” “establishment and/or maintenance of chromatin architecture,” “DNA packaging,” and “chromosome organization and biogenesis (sensu Eukaryota),” were ranked higher in the ascending list of p
-values of the “biological process” terms. Also, the terms “nucleosome,” “chromatin,” and “chromosome” in the “cellular component” subsystem represent different aspects of similar functions. All of these nine terms coherently point to the following five head-to-head gene pairs, HIST1H2BN
which are all histone coding genes (the first five entries in ). Apart from these pairs, we also found 11 more histone-coding head-to-head pairs (the other 11 pairs in ) in Table S1
according to the gene names and summaries provided by the NCBI Gene Database, which were not covered by the cofunction list (the latter three sheets of Table S10
) because at least one member of a pair has not yet been annotated by the GO system. Taken together, the 16 pairs involve a total of 31 genes since HIST1H2BF
could form two head-to-head pairs with overlapping genes HIST1H2AD
respectively. The 31 involved genes take 37% of a total of 83 genes located in the histone clusters. It is noticeable that all 16 pairs are organized in a nonoverlapping head-to-head manner, and most of them have very similar TSS distances. However, among the eight pairs with available microarray data, only one pair, HIST1H2AC/HIST1H2BC
, shows positive expression correlations at p
< 0.05. We could not exclude the possibility that the other pairs might have expression correlation under other experimental conditions.
The Human Head-to-Head Gene Pairs Coding for Histone
We noticed that there are four collagen-related significant terms in the “cellular component” subsystem (), including “collagen type IV,” “sheet-forming collagen,” “collagen,” and “basement membrane,” coherently pointing to three head-to-head gene pairs, COL4A2/COL4A1, COL4A3/COL4A4, and COL4A5/COL4A6. They are all of the nonoverlapping type and located in Chromosomes 2, 13, and X, respectively. These three pairs were also annotated by several other significant cofunctions in the other two subsystems, such as “extracellular matrix structural constituent” in the “molecular function” subsystem and “inorganic anion transport,” “anion transport,” and “phosphate transport” in the “biological process” subsystem. Interestingly, the COL4A2/COL4A1 pair and the COL4A5/COL4A6 pair display significant positive expression correlations at p < 0.05; in contrast, COL4A3/COL4A4 display a negative correlation.