Bidirectional promoters are shared regulatory regions that influence the expression of two oppositely oriented genes. This type of regulatory architecture is found more frequently than expected by chance in the human genome, yet many specifics underlying the regulatory design are unknown. Given that the function of most orthologous genes is similar across species, we hypothesized that the architecture and regulation of bidirectional promoters might also be similar across species, representing a core regulatory structure and enabling annotation of these regions in additional mammalian genomes.
By mapping the intergenic distances of genes in human, chimpanzee, bovine, murine, and rat, we show an enrichment for pairs of genes equal to or less than 1,000 bp between their adjacent 5' ends ("head-to-head") compared to pairs of genes that fall in the same orientation ("head-to-tail") or whose 3' ends are side-by-side ("tail-to-tail"). A representative set of 1,369 human bidirectional promoters was mapped to orthologous sequences in other mammals. We confirmed predictions for 5' UTRs in nine of ten manual picks in bovine based on comparison to the orthologous human promoter set and in six of seven predictions in human based on comparison to the bovine dataset. The two predictions that did not have orthology as bidirectional promoters in the other species resulted from unique events that initiated transcription in the opposite direction in only those species. We found evidence supporting the independent emergence of bidirectional promoters from the family of five RecQ helicase genes, which gained their bidirectional promoters and partner genes independently rather than through a duplication process. Furthermore, by expanding our comparisons from pairwise to multispecies analyses we developed a map representing a core set of bidirectional promoters in mammals.
We show that the orthologous positions of bidirectional promoters provide a reliable guide to directly annotate over one thousand regulatory regions in sequences of mammalian genomes, while also serving as a useful tool to predict 5' UTR positions and identify genes that are novel to a single species.
Current intergene distance is shown to be consistently the strongest predictor of synteny conservation as expected under a simple null model, and other variables are of lesser importance.
Why do some groups of physically linked genes stay linked over long evolutionary periods? Although several factors are associated with the formation of gene clusters in eukaryotic genomes, the particular contribution of each feature to clustering maintenance remains unclear.
We quantify the strength of the proposed factors in a yeast lineage. First we identify the magnitude of each variable to determine linkage conservation by using several comparator species at different distances to Saccharomyces cerevisiae. For adjacent gene pairs, in line with null simulations, intergenic distance acts as the strongest covariate. Which of the other covariates appear important depends on the comparator, although high co-expression is related to synteny conservation commonly, especially in the more distant comparisons, these being expected to reveal strong but relatively rare selection. We also analyze those pairs that are immediate neighbors through all the lineages considered. Current intergene distance is again the best predictor, followed by the local density of essential genes and co-regulation, with co-expression and recombination rate being the weakest predictors. The genome duplication seen in yeast leaves some mark on linkage conservation, as adjacent pairs resolved as single copy in all post-whole genome duplication species are more often found as adjacent in pre-duplication species.
Current intergene distance is consistently the strongest predictor of synteny conservation as expected under a simple null model. Other variables are of lesser importance and their relevance depends both on the species comparison in question and the fate of the duplicates following genome duplication.
Many mammalian genes are organized as bidirectional (head-to-head) gene pairs with the two genes separated only by less than 1 kb. The transcriptional regulation of these bidirectional gene pairs remains largely unclear, but a few studies have suggested that the two closely adjacent genes in divergent orientation can be co-regulated by a single transcription factor binding to a specific regulatory fragment. Here we report an evolutionarily conserved bidirectional gene pair, known as the PREPL-C2ORF34 gene pair, whose transcription relies on the synergic cooperation of two transcription factors binding to an intergenic bidirectional minimal promoter.
While PREPL is present primarily in brain and heart, C2ORF34 is ubiquitously and abundantly expressed in almost all tissues. Genomic analyses revealed that these two non-homologous genes are adjacent in a head-to-head configuration on human chromosome 2p21 and separated by only 405 bp. Within this short intergenic region, a 243-bp GC-rich segment was demonstrated to function as a bidirectional minimal promoter to initiate the transcription of both flanking genes. Two key transcription factors, NRF-2 and YY-1, were further identified to coordinately participate in driving both gene expressions in an additive manner. The functional cooperation between these two transcription factors, along with their genomic binding sites and some cis-acting repressive elements, are essential for the transcriptional activation and tissue distribution of the PREPL-C2ORF34 bidirectional gene pair.
This study provides new insights into the complex transcriptional mechanism of a mammalian head-to-head gene pair which requires cooperative binding of multiple transcription factors to a bidirectional minimal promoter of the shared intergenic region.
Gene order and content differ among homologous regions of closely related genomes. Similarities in the expression profiles of physically adjacent genes suggest that the proper functioning of these genes depends on maintaining a specific position relative to each other. To better understand the results of the interaction of these two genomic forces, convergent, divergent, and tandem gene pairs in rice and sorghum, as well as their homologs in rice, sorghum, maize, and Brachypodium were analyzed. The status of each pair in all four species: whether it was conserved, inverted, rearranged, or missing homologs was determined. We observed that divergent gene pairs had lower rates of conservation than convergent or tandem pairs, but higher rates of rearranged pairs and missing homologs in maize than in any other species. We also discovered species-specific gene pairs in rice and sorghum. In rice, gene pairs with strongly correlated expression levels were conserved significantly more often than those with little or no correlation. We assigned three types of gene pair to one of 14 possible evolutionary history categories to uncover their evolutionary dynamics during the evolution of grass genomes.
gene pairs; grass genomes; conservation; rearrangement; coexpression
Divergently-paired genes (DPGs) are defined as two adjacent genes that are transcribed toward the opposite direction (or from different DNA strands) and shared their transcription start sites (TSSs) less than 1,000 base pairs apart. DPGs are products of a common organizational feature among eukaryotic genes yet to be surveyed across divergent genomes over well-defined evolutionary distances since mutations in the sequence between a pair of DPGs may result in alternations in shared promoters and thus affect the function of both genes. By sharing promoters, the gene pairs take the advantage of co-regulation albeit bearing doubled mutational burdens in maintaining their normal functions.
Drosophila melanogaster has a significant fraction (31.6% of all genes) of DPGs which are remarkably conserved relative to its gene density as compared to other eukaryotes. Our survey and comparative analysis revealed different evolutionary patterns among DPGs between insect and vertebrate lineages. The conservation of DPGs in D. melanogaster is of significance as they are mostly housekeeping genes characterized by the absence of TATA box in their promoter sequences. The combination of Initiator and Downstream Promoter Element may play an important role in regulating DPGs in D. melanogaster, providing an excellent niche for studying the molecular details for transcription regulations.
DPGs appear to have arisen independently among different evolutionary lineages, such as the insect and vertebrate lineages, and exhibit variable degrees of conservation. Such architectural organizations, including convergently-paired genes (CPGs) may associate with transcriptional regulation and have significant functional relevance.
An important step in understanding the regulation of a prokaryotic genome is the generation of its transcription unit map. The current strongest operon predictor depends on the distributions of intergenic distances (IGD) separating adjacent genes within and between operons. Unfortunately, experimental data on these distance distributions are limited to Escherichia coli and Bacillus subtilis. We suggest a new graph algorithmic approach based on comparative genomics to identify clusters of conserved genes independent of IGD and conservation of gene order. As a consequence, distance distributions of operon pairs for any arbitrary prokaryotic genome can be inferred. For E.coli, the algorithm predicts 854 conserved adjacent pairs with a precision of 85%. The IGD distribution for these pairs is virtually identical to the E.coli operon pair distribution. Statistical analysis of the predicted pair IGD distribution allows estimation of a genome-specific operon IGD cut-off, obviating the requirement for a training set in IGD-based operon prediction. We apply the method to a representative set of eight genomes, and show that these genome-specific IGD distributions differ considerably from each other and from the distribution in E.coli.
Neighboring genes in the eukaryotic genome have a tendency to express concurrently, and the proximity of two adjacent genes is often considered a possible explanation for their co-expression behavior. However, the actual contribution of the physical distance between two genes to their co-expression behavior has yet to be defined. To further investigate this issue, we studied the co-expression of neighboring genes in zebrafish, which has a compact genome and has experienced a whole genome duplication event. Our analysis shows that the proportion of highly co-expressed neighboring pairs (Pearson’s correlation coefficient R>0.7) is low (0.24% ~ 0.67%); however, it is still significantly higher than that of random pairs. In particular, the statistical result implies that the co-expression tendency of neighboring pairs is negatively correlated with their physical distance. Our findings therefore suggest that physical distance may play an important role in the co-expression of neighboring genes. Possible mechanisms related to the neighboring genes’ co-expression are also discussed.
gene expression; co-expression; neighboring genes; promoter; zebrafish
We have analyzed a sequence of approximately 70 base pairs (bp) that shows a high degree of similarity to sequences present in the non-coding regions of a number of human and other mammalian genes. The sequence was discovered in a fragment of human genomic DNA adjacent to an integrated hepatitis B virus genome in cells derived from human hepatocellular carcinoma tissue. When one of the viral flanking sequences was compared to nucleotide sequences in GenBank, more than thirty human genes were identified that contained a similar sequence in their non-coding regions. The sequence element was usually found once or twice in a gene, either in an intron or in the 5' or 3' flanking regions. It did not share any similarities with known short interspersed nucleotide elements (SINEs) or presently known gene regulatory elements. This element was highly conserved at the same position within the corresponding human and mouse genes for myoglobin and N-myc, indicating evolutionary conservation and possible functional importance. Preliminary DNase I footprinting data suggested that the element or its adjacent sequences may bind nuclear factors to generate specific DNase I hypersensitive sites. The size, structure, and evolutionary conservation of this sequence indicates that it is distinct from other types of short interspersed repetitive elements. It is possible that the element may have a cis-acting functional role in the genome.
Given genetic networks derived from two genomes, it may be difficult to decide if their local structures are similar enough in both genomes to infer some ancestral configuration or some conserved functional relationships. Current methods all depend on searching for identical substructures.
We explore a generalized vertex proximity criterion, and present analytic and probability results for the comparison of random lattice networks.
We apply this criterion to the comparison of the genetic networks of two evolutionarily divergent yeasts, Saccharomyces cerevisiae and Schizosaccharomyces pombe, derived using the Synthetic Genetic Array screen. We show that the overlapping parts of the networks of the two yeasts share a common structure beyond the shared edges. This may be due to their conservation of redundant pathways containing many synthetic lethal pairs of genes.
Detecting the shared generalized adjacency clusters in the genetic networks of the two yeasts show that this analytical construct can be a useful tool in probing conserved network structure across divergent genomes.
Ty transposable-element insertion mutations of Saccharomyces cerevisiae can cause cell-type-dependent activation of adjacent-gene expression. Several cis-acting regulatory regions within Ty1 are responsible for the effect of Ty1 on adjacent-gene expression. One of these is the block II sequence that was defined by its homology to mammalian enhancers and to the yeast a1-alpha 2 control site. Tandem copies of a 57-base-pair region encompassing block II caused an additive increase in expression of the CYC7 reporter gene in the absence of other Ty1 sequences. The activation of gene expression by the multiple repeats was abolished in a/alpha diploid cells. A specific complex between a constitutive factor in whole-cell extracts and the DNA regulatory element was observed. The protein-binding site for the constitutive factor coincided with the block II element. Base-pair substitutions within the binding site abolished the ability of the block II element to function as a component of the Ty1 activator and to form the factor-DNA complex. The correlation between complex formation and reporter gene expression indicates that factor binding to the cis-acting element is essential for this element to function as a component of the Ty1 activator.
The genetic relationship between the retrovirus-like intracisternal type A particle (IAP) from Mus musculus and the novel retrovirus (M432) from M. cervicolor has been determined by heteroduplex and restriction endonuclease analyses of molecular clones of the respective genomes. We have found a major homology region (3.7 kilobase pairs) which probably begins near the 3' end of the M432 gag gene, spans the pol gene, and ends in the env gene. A second region (0.6 kilobase pairs) of weak homology was also observed adjacent to the 3' long terminal repeats of the respective genomes. The IAP genome is well conserved in the cellular DNA of all species of the genus Mus. In contrast, cellular DNA sequences related to the 5' end of the M432 genome, which shares no homology with the IAP genome, are found only in M. cervicolor and the closely related species M. cookii. These results suggest that the infectious M432 retroviral genome arose as a result of a recombinational event(s) between the IAP genome and another, as yet unidentified, class of retrovirus-related sequences or other cellular sequences.
The molecular evolution of cis-regulatory sequences is not well understood. Comparisons of closely related species show that cis-regulatory sequences contain a large number of sites constrained by purifying selection. In contrast, there are a number of examples from distantly related species where cis-regulatory sequences retain little to no sequence similarity but drive similar patterns of gene expression. Binding site turnover, whereby the gain of a redundant binding site enables loss of a previously functional site, is one model by which cis-regulatory sequences can diverge without a concurrent change in function. To determine whether cis-regulatory sequence divergence is consistent with binding site turnover, we examined binding site evolution within orthologous intergenic sequences from 14 yeast species defined by their syntenic relationships with adjacent coding sequences. Both local and global alignments show that nearly all distantly related orthologous cis-regulatory sequences have no significant level of sequence similarity but are enriched for experimentally identified binding sites. Yet, a significant proportion of experimentally identified binding sites that are conserved in closely related species are absent in distantly related species and so cannot be explained by binding site turnover. Depletion of binding sites depends on the transcription factor but is detectable for a quarter of all transcription factors examined. Our results imply that binding site turnover is not a sufficient explanation for cis-regulatory sequence evolution.
evolution; regulation; yeast
Gene duplication and subsequent functional divergence especially expression divergence have been widely considered as main sources for evolutionary innovations. Many studies evidenced that genetic regulatory network evolved rapidly shortly after gene duplication, thus leading to accelerated expression divergence and diversification. However, little is known whether epigenetic factors have mediated the evolution of expression regulation since gene duplication. In this study, we conducted detailed analyses on yeast histone modification (HM), the major epigenetics type in this organism, as well as other available functional genomics data to address this issue.
Duplicate genes, on average, share more common HM-code patterns than random singleton pairs in their promoters and open reading frames (ORF). Though HM-code divergence between duplicates in both promoter and ORF regions increase with their sequence divergence, the HM-code in ORF region evolves slower than that in promoter region, probably owing to the functional constraints imposed on protein sequences. After excluding the confounding effect of sequence divergence (or evolutionary time), we found the evidence supporting the notion that in yeast, the HM-code may co-evolve with cis- and trans-regulatory factors. Moreover, we observed that deletion of some yeast HM-related enzymes increases the expression divergence between duplicate genes, yet the effect is lower than the case of transcription factor (TF) deletion or environmental stresses.
Our analyses demonstrate that after gene duplication, yeast histone modification profile between duplicates diverged with evolutionary time, similar to genetic regulatory elements. Moreover, we found the evidence of the co-evolution between genetic and epigenetic elements since gene duplication, together contributing to the expression divergence between duplicate genes.
Histone modification; Histone modification code divergence; Gene duplication; Expression divergence; Epigenetic divergence; cis-regulation; trans-regulation
The organization and function of potential regulatory elements associated with the promoters of chicken H2A and H2B genes pairs have been examined. The intergene regions of six dispersed and divergently-transcribed H2A/H2B gene pairs contain several extremely well conserved and spaced blocks of sequence homology. Adjacent coding regions are on average 342 base-pairs apart. Respective TATA boxes are separated by 180 base-pairs and within this confined region there are four CCAAT boxes and a previously identified 13 base-pair H2B-specific element (H2B-box) which has homology to the octamer motif present in a number of gene promoter/enhancer elements. Transcription of H2A and H2B genes from wild-type and mutant constructs was measured in transient assays by transfection into HeLa cells, and in permanently transformed clonal cell lines. In vitro separation of the two genes at a unique intergenic site significantly decreased transcription of each gene. This suggested that the H2A/H2B gene pairs contained overlapping promoters. Deletion or point mutagenesis of the H2B-specific element decreased the levels of H2B and the H2A transcripts indicating that this sequence is a common regulatory element of both genes in the divergent-pair configeration.
Ty transposable element insertion mutations of Saccharomyces cerevisiae can cause cell-type-dependent activation of adjacent gene expression. Several cis-acting regulatory regions within Ty1 that are responsible for these effects were identified. A 211-base-pair (bp) region functions as an activator. This region includes the so-called U5 domain of delta and 145 bp of adjacent epsilon sequences. Unlike activation by the intact Ty1, activation by the 211-bp Ty1 subfragment is cell-type independent. The presence of a 112-bp fragment from a more distal region of Ty1 confers cell-type specificity to the activator. The 112-bp fragment includes sequences with homology to mammalian enhancers and to a yeast a/alpha control site. In addition, Ty1 regions that exert negative effects on gene expression were identified. These results demonstrate that the Ty1 transcriptional control region consists of multiple components with distinct regulatory functions.
The prediction of operons in Mycobacterium tuberculosis (MTB) is a first step toward understanding the regulatory network of this pathogen. Here we apply a statistical model using logistic regression to predict operons in MTB. As predictors, our model incorporates intergenic distance and the correlation of gene expression calculated for adjacent gene pairs from over 474 microarray experiments with MTB RNA. We validate our findings with known examples from the literature and experimentation. From this model, we rank each potential operon pair by the strength of evidence for cotranscription, choose a classification threshold with a true positive rate of over 90% at a false positive rate of 9.1%, and use it to construct an operon map for the MTB genome.
A ‘head-to-head’ (h2h) gene pair is defined as a genomic locus in which two adjacent genes are divergently transcribed from opposite strands of DNA. In our previous work, this gene organization was found to be ancient and conserved, which subjects functionally related genes to transcriptional co-regulation. However, some of the biological features of h2h pairs still need further clarification.
In this work, we assorted human h2h pairs into four sequentially inclusive sets of gradually incremental conservation, and examined whether those previously asserted features were conserved or sharpened in the more conserved h2h pair sets in order to identify the inherent features of the h2h gene organization. The features of TSS distance, expression correlation within h2h pairs and among h2h genes, transcription factor association and functional similarities of h2h genes were examined. Our conservation-based analyses found that the bi-directional promoters of h2h gene pairs are most likely shorter than 100 bp; h2h gene pairs generally have only significant positive expression correlation but not negative correlation, and remarkably high positive expression correlations exist among h2h genes, as well as between h2h pairs observed in our previous study; h2h paired genes tend to share transcription factors. In addition, expression correlation of h2h pairs is positively related with the TF-sharing and functional coordination, while not related with TSS distance.
Our findings remove the uncertainties of h2h genes about TSS distance, expression correlation and functional coordination, which provide insights into the study on the molecular mechanisms and functional consequences of the transcriptional regulation based on this special gene organization.
There is increasing evidence that gene order within the eukaryotic genome is not random. In yeast and worm, adjacent or neighboring genes tend to be co-expressed. Clustering of co-expressed genes has been found in humans, worm and fruit flies. However, in mice and rats, an effect of chromosomal distance (CD) on co-expression has not been investigated yet. Also, no cross-species comparison has been made so far. We analyzed the effect of CD as well as normalized distance (ND) using expression data in six eukaryotic species: yeast, fruit fly, worm, rat, mouse and human.
We analyzed 24 sets of expression data from the six species. Highly co-expressed pairs were sorted into bins of equal sized intervals of CD, and a co-expression rate (CoER) in each bin was calculated. In all datasets, a higher CoER was obtained in a short CD range than a long distance range. These results show that across all studied species, there was a consistent effect of CD on co-expression. However, the results using the ND show more diversity. Intra- and inter-species comparisons of CoER reveal that there are significant differences in the co-expression rates of neighboring genes among the species. A pair-wise BLAST analysis finds 8 – 30 % of the highly co-expressed pairs are duplic ated genes.
We confirmed that in the six eukaryotic species, there was a consistent tendency that neighboring genes are likely to be co-expressed. Results of pair-wised BLAST indicate a significant effect of non-duplicated pairs on co-expression. A comparison of CD and ND suggests the dominant effect of CD.
co-expression; inter-species comparison; chromosomal distance; eukaryotic genome
In the mammalian genome, a substantial number of gene pairs (approximately 10%) are arranged head-to-head on opposite strands within 1,000 base pairs, and separated by a bidirectional promoter(s) that generally drives the co-expression of both genes and results in functional coupling. The significance of unique genomic configuration remains elusive.
Here we report on the identification of an intergenic region of non-homologous genes, CDT2, a regulator of DNA replication, and an integrator complex subunit 7 (INTS7), an interactor of the largest subunit of RNA polymerase II. The CDT2-INTS7 intergenic region is 246 and 245 base pairs long in human and mouse respectively and is evolutionary well-conserved among several mammalian species. By measuring the luciferase activity in A549 cells, the intergenic human sequence was shown to be able to drive the reporter gene expression in either direction and notably, among transcription factors E2F, E2F1∼E2F4, but not E2F5 and E2F6, this sequence clearly up-regulated the reporter gene expression exclusively in the direction of the CDT2 gene. In contrast, B-Myb, c-Myb, and p53 down-regulated the reporter gene expression in the transcriptional direction of the INTS7 gene. Overexpression of E2F1 by adenoviral-mediated gene transfer resulted in an increased CDT2, but not INTS7, mRNA level. Real-time polymerase transcription (RT-PCR) analyses of the expression pattern for CDT2 and INTS7 mRNA in human adult and fetal tissues and cell lines revealed that transcription of these two genes are asymmetrically regulated. Moreover, the abundance of mRNA between mouse and rat tissues was similar, but these patterns were quite different from the results obtained from human tissues.
These findings add a unique example and help to understand the mechanistic insights into the regulation of gene expression through an evolutionary conserved intergenic region of the mammalian genome.
There has been much evidence recently for a link between transcriptional regulation and chromosomal gene order, but the relationship between genomic organization, regulation and gene function in higher eukaryotes remains to be precisely defined.
Here, we present evidence for organization of a large proportion of a human transcriptome into gene clusters throughout the genome, which are partly regulated by the same transcription factors, share biological functions and are characterized by non-housekeeping genes. This analysis was based on the cardiac transcriptome identified by our genome-wide array analysis of 55 human heart samples. We found 37% of these genes to be arranged mainly in adjacent pairs or triplets. A significant number of pairs of adjacent genes are putatively regulated by common transcription factors (p = 0.02). Furthermore, these gene pairs share a significant number of GO functional classification terms. We show that the human cardiac transcriptome is organized into many small clusters across the whole genome, rather than being concentrated in a few larger clusters.
Our findings suggest that genes expressed in concert are organized in a linear arrangement for coordinated regulation. Determining the relationship between gene arrangement, regulation and nuclear organization as well as gene function will have broad biological implications.
Ribosomal protein genes (RPGs) are essential, tightly regulated, and highly expressed during embryonic development and cell growth. Even though their protein sequences are strongly conserved, their mechanism of regulation is not conserved across yeast, Drosophila, and vertebrates. A recent investigation of genomic sequences conserved across both nematode species and associated with different gene groups indicated the existence of several elements in the upstream regions of C. elegans RPGs, providing a new insight regarding the regulation of these genes in C. elegans.
In this study, we performed an in-depth examination of C. elegans RPG regulation and found nine highly conserved motifs in the upstream regions of C. elegans RPGs using the motif discovery algorithm DME. Four motifs were partially similar to transcription factor binding sites from C. elegans, Drosophila, yeast, and human. One pair of these motifs was found to co-occur in the upstream regions of 250 transcripts including 22 RPGs. The distance between the two motifs displayed a complex frequency pattern that was related to their relative orientation.
We tested the impact of three of these motifs on the expression of rpl-2 using a series of reporter gene constructs and showed that all three motifs are necessary to maintain the high natural expression level of this gene. One of the motifs was similar to the binding site of an orthologue of POP-1, and we showed that RNAi knockdown of pop-1 impacts the expression of rpl-2. We further determined the transcription start site of rpl-2 by 5’ RACE and found that the motifs lie 40–90 bases upstream of the start site. We also found evidence that a noncoding RNA, contained within the outron of rpl-2, is co-transcribed with rpl-2 and cleaved during trans-splicing.
Our results indicate that C. elegans RPGs are regulated by a complex novel series of regulatory elements that is evolutionarily distinct from those of all other species examined up until now.
Correlation between the expression levels of genes which are located close to each other on the genome has been found in various organisms, including yeast, drosophila and humans. Since such a correlation could be explained by several biochemical, evolutionary, genetic and technological factors, there is a need for statistical models that correspond to specific biological models for the correlation structure.
We modelled the pairwise correlation between the expressions of the genes in a Drosophila microarray experiment as a normal mixture under Fisher's z-transform, and fitted the model to the correlations of expressions of adjacent as well as non-adjacent genes. We also analyzed simulated data for comparison. The model provided a good fit to the data. Further, correlation between the activities of two genes could, in most cases, be attributed to either of two factors: the two genes both being active in the same age group (adult or embryo), or the two genes being in proximity of each other on the chromosome. The interaction between these two factors was weak.
Correlation between the activities of adjacent genes is higher than between non-adjacent genes. In the data we analyzed, this appeared, for the most part, to be a constant effect that applied to all pairs of adjacent genes.
The determination of conserved sequences has identified individual transcription factor binding sites. The incorporation of both joint conservation and spacing constraints of sequence pairs predicts groups of target genes that are specific for common patterns of gene expression.
Transcriptional regulation in eukaryotes often involves multiple transcription factors binding to the same transcription control region, and to understand the regulatory content of eukaryotic genomes it is necessary to consider the co-occurrence and spatial relationships of individual binding sites. The determination of conserved sequences (often known as phylogenetic footprinting) has identified individual transcription factor binding sites. We extend this concept of functional conservation to higher-order features of transcription control regions.
We used the genome sequences of four yeast species of the genus Saccharomyces to identify sequences potentially involved in multifactorial control of gene expression. We found 989 potential regulatory 'templates': pairs of hexameric sequences that are jointly conserved in transcription regulatory regions and also exhibit non-random relative spacing. Many of the individual sequences in these templates correspond to known transcription factor binding sites, and the sets of genes containing a particular template in their transcription control regions tend to be differentially expressed in conditions where the corresponding transcription factors are known to be active. The incorporation of word pairs to define sequence features yields more specific predictions of average expression profiles and more informative regression models for genome-wide expression data than considering sequence conservation alone.
The incorporation of both joint conservation and spacing constraints of sequence pairs predicts groups of target genes that are specific for common patterns of gene expression. Our work suggests that positional information, especially the relative spacing between transcription factor binding sites, may represent a common organizing principle of transcription control regions.
The order of genes in eukaryotes is not entirely random. Studies of gene order conservation are important to understand genome evolution and to reveal mechanisms why certain neighboring genes are more difficult to separate during evolution. Here, genome-wide gene order information was compiled for 64 species, representing a wide variety of eukaryotic phyla. This information is presented in a browser where gene order may be displayed and compared between species. Factors related to non-random gene order in eukaryotes were examined by considering pairs of neighboring genes. The evolutionary conservation of gene pairs was studied with respect to relative transcriptional direction, intergenic distance and functional relationship as inferred by gene ontology. The results show that among gene pairs that are conserved the divergently and co-directionally transcribed genes are much more common than those that are convergently transcribed. Furthermore, highly conserved pairs, in particular those of fungi, are characterized by a short intergenic distance. Finally, gene pairs of metazoa and fungi that are evolutionary conserved and that are divergently transcribed are much more likely to be related by function as compared to poorly conserved gene pairs. One example is the ribosomal protein gene pair L13/S16, which is unusual as it occurs both in fungi and alveolates. A specific functional relationship between these two proteins is also suggested by the fact that they are part of the same operon in both eubacteria and archaea. In conclusion, factors associated with non-random gene order in eukaryotes include relative gene orientation, intergenic distance and functional relationships. It seems likely that certain pairs of genes are conserved because the genes involved have a transcriptional and/or functional relationship. The results also indicate that studies of gene order conservation aid in identifying genes that are related in terms of transcriptional control.
Humans share about 99% of their genomic DNA with chimpanzees and bonobos; thus, the differences between these species are unlikely to be in gene content but could be
caused by inherited changes in regulatory systems. Endogenous retroviruses (ERVs)
comprise ∼ 5% of the human genome. The LTRs of ERVs contain many regulatory
sequences, such as promoters, enhancers, polyadenylation signals and factor-binding
sites. Thus, they can influence the expression of nearby human genes. All known
human-specific LTRs belong to the HERV-K (human ERV) family, the most active
family in the human genome. It is likely that some of these ERVs could have integrated
into regulatory regions of the human genome, and therefore could have had an impact
on the expression of adjacent genes, which have consequently contributed to human
evolution. This review discusses possible functional consequences of ERV integration
in active coding regions.