Orthologous gene pairs identify ancestral patterns of gene regulation
Human protein-coding genes regulated by bidirectional promoters were placed into 821 pairs and mapped to other species using conserved synteny information applied through the approach outlined in Figure . Three types of outcomes were detected in the species being compared to human, including (I) the orthologous bidirectional gene pair was present in the second species (II) only one member of the gene pair was present in the second species and (III) no evidence existed for a bidirectional promoter in the second species. Comparisons were between human, chimp, rhesus, dog, mouse, chicken, Fugu, and zebrafish.
Figure 1 Flow diagram for mapping orthologous bidirectional promoters. The initial stage identifies orthologous regions between humans and other species. This stage is further refined by defining whether these regions align to non-gapped regions (as nearly perfect (more ...)
Figure illustrates the evolutionary history of bidirectional promoters in vertebrates. For instance, 60 pairs of human genes showed orthologs in all seven species. Another set of genes had orthologs in mammals and fish, but not birds, suggesting evidence for lineage specific loss. Other examples had gene pairs lost only in Fugu or zebrafish, suggesting missing annotations in one fish or another. One other set of genes was absent from dog annotations, but present in primates and mouse, suggesting that these genes were missing from the dog genome annotations.
Figure 2 Mapping the evolutionary history of bidirectional promoters. Human bidirectional promoters were mapped by their surrounding orthologous genes. Examples marked by a red bar correspond to orthologous bidirectional promoters, where both human genes are present (more ...)
Other sets of bidirectional promoters showed a lineage-specific history. For instance, a large group of mammal-specific genes was not present in chickens or fish. A smaller group was only present in primates. In contrast, genes that were present in all species except chimp were likely to be missing from chimp due to assembly problems. Nearly twenty pairs of genes were found only in the human genome.
Intergenic distance at bidirectional promoters
The distance between TSSs at bidirectional promoters was mapped for human and other vertebrate species. Each species is shown in two graphs. One graph depicts the raw distance measurements between the TSS in human and the second species (Figure ). The distance measurements are graphed with human on the x-axis and the second species on the y-axis. The scatter plots indicate the size of the datasets and the correlation of the bidirectional promoter lengths at orthologous gene pairs of eight species. The red line shows the position of a linear relationship (x = y), where the distances between the TSSs are the same for the two species.
Figure 3 Distance mapping between orthologous bidirectional promoters. Each species is compared to the human dataset in two graphs. The left graph plots the distance between transcription start sites for human and the second species at orthologous bidirectional (more ...)
The second graph shows the cumulative percentage of bidirectional promoters mapped in human and a second species, where the human dataset is limited to a 1000 bp distance. The most complete annotations were found in the human-mouse comparison. This result is illustrated by the similar curves for the cumulative percentage of orthologous bidirectional promoters in mouse that fall within 1000 bp. Up to 80% of all human bidirectional promoters were identified in mouse at this similar distance. In comparison, 75% of the human promoters were present in chimp within 1000 bp. The high levels of orthology found in mouse and chimp suggest that the 1000 bp distance will capture similar gene sets in other species. Thus we predict that the gene annotations of chimp, rhesus and dog will improve to represent a minimum of 80% of the bidirectional promoters the human genome.
Evolutionary comparison of head-to-head and tail-to-tail gene pairs
The percentage of human bidirectional promoters detected at distances up to 1000 bp was compared to the cumulative percentage detected in other species (Figure ). Evidence of selective pressure was determined from the retention of human tail-to-tail genes, spaced within the 1000 bp limit, in other species. Pairs of genes representing bidirectional promoters are shown in green and tail-to-tail genes in purple. The same color scheme was used for the second species, except that different symbols were used. Although the total percentage of genes mapped in the second species was less than 100% for chimp, rhesus, and dog, the head-to-head and tail-to-tail gene sets had equivalent amounts at the 1000 bp distance. In these datasets the tail-to-tail genes plateau at a longer intergenic distance than the head-to-head genes. Thus a larger distance between the orthologous genes has been tolerated without deleterious effects.
Figure 4 Comparison of head-to-head and tail-to-tail gene pairs identified at orthologous positions. Bidirectional promoter data are graphed in green, with dots representing human and plus signs representing the other species. Tail-to-tail gene pairs are represented (more ...)
For chicken datasets the head-to-head gene sets were found more frequently than the tail-to-tail sets at 1000 bp, indicating that tail-to-tail arrangements of genes had been allowed to change in both distance and arrangement more often than head-to-head genes. These results indicate that selective pressure acts more strongly over evolutionary time to keep head-to-head genes together at the 1000 bp distance compared to tail-to-tail genes.
The data from the fish genomes indicated that very long distances were necessary to capture a majority of the human gene pairs. Given the compact nature of the fish genomes, it is unlikely that many of these long distance associations are biologically relevant. However, the preservation of tightly associated genes indicated the presence of important regulatory or functional roles that cannot be disturbed.
Gene ontology associated with bidirectional promoter regulation
Functions associated with orthologous genes regulated by bidirectional promoters were examined for those conserved in all seven species, or in the four mammals. Sixty pairs of genes were conserved across all seven species. These genes were examined for functional classifications. Four groups emerged: intracellular membrane bound organelle, macromolecule metabolism, chaperone, and mitochondrion. The p-values on these groups ranged from 10E-3 to 10E-1, and remained statistically significant following Benjamini correction for false discovery rate (i.e. ~2.7E-1).
Genes that were conserved across the four mammals had a much larger range of functional activities. Of 342 pairs of genes, catalytic activity emerged as the most significant enrichment in any functional class (6.1E-4 after Benjamini correction). Thus bidirectional promoters are regulating many enzymes in mammalian genomes. In total, 58 functional classes were significantly enriched in this dataset compared to a random collection of genes. These data indicate that the regulatory domain of bidirectional promoters has expanded to encompass a much larger set of gene functions in mammals.
Training ESPERR to discriminate bidirectional promoters
Our previous work indicated that sequence-based characteristics were different in bidirectional promoters and non-bidirectional promoters [7
]. However the size of the datasets was quite disparate (1,005 bidirectional, 17,613 non-bidirectional). Therefore for training ESPERR [8
] we sampled equal size subsets of 800 elements from each class (keeping the remaining elements in each class as test sets for verification). For each training interval we then extracted genomic alignments of six species (human, chimpanzee, macaque, mouse, rat, and dog) from the 17 species alignments available in the UCSC Genome Browser. Regions of the training data overlapping coding exons (from UCSC Known Genes) were masked out. We first performed an unsupervised encoding selection (the first stage of the ESPERR procedure) to create an encoding in 10 symbols. Leave-one-out cross validation on the training data using this encoding yielded a success rate of 76%. On the bidirectional test set, the model trained using this mapping correctly classified 404 elements (89%) and incorrectly classified 50 (17 elements were not included due to insufficient alignment). On the non-bidirectional test set it successfully classified 11,150 elements (70%), and incorrectly classified 4,845 (687 elements not included). Next, we performed the full ESPERR procedure, using the first stage reduction to produce an encoding of size 75, which was then refined with the heuristic search yielding an encoding of size 10. The resulting encoding gave a modest improvement in cross-validation, with a success rate of 82%. However, on the bidirectional test set, the model using this encoding classified 405 elements (89%) and incorrectly classified 49 (17 elements were again not included due to insufficient alignment). On the non-bidirectional test set it successfully classified 10,900 elements (68%), and incorrectly classified 5,095 (687 elements not included). Thus, using the ESPERR heuristic search gives no improvement for classifying this dataset. It is noteworthy that these classification rates, though modest, indicate that there are sequence and evolutionary patterns that can be captured to characterize bidirectional promoters. Particularly interesting is the substantially greater generalization rate for the bidirectional test set, suggesting that there are more characteristic signals for these elements that can be captured. This is consistent with the result of the ESPERR heuristic search – optimizing the encoding using the training data gives a slight improvement in recognizing the bidirectional test elements, but at the cost of poorer performance on the non-bidirectional test set.