The relationship between genetic sequence and transcriptional regulation is central to understanding species-specific biology, disease, and evolution (1
). Identifying the divergence and conservation among functional regulatory elements is an important goal of comparative genomic research, and this is often done via DNA sequence comparisons using distant (2
) and closely related species (3
). Although both approaches have successfully identified conserved regulatory regions, the majority of transcription factor (TF) binding events change rapidly between closely related species, making them difficult to detect using DNA sequence alone (4
). For instance, the experimentally-determined binding events for homologous TFs found in mouse and human livers are unlikely to align with each other (7
), despite conservation of their functional targets (8
) and global liver transcription (9
). The evolution of mammalian transcriptional regulation remains largely unexplored beyond limited mouse-human comparisons.
We therefore identified the genome-wide binding of two transcription factors: (i) CEBPA, in livers of species representing five vertebrate orders: human (primate), mouse (rodent), dog (carnivora), short-tailed opossum (didelphimorphia), and chicken (galliformes), and (ii) HNF4A, in livers from human, mouse, and dog. Chromatin immunoprecipitation experiments were combined with high-throughput sequencing (ChIP-seq) using healthy, nutritionally unstressed adult liver from the heterogametic sex as a functionally and transcriptionally conserved homologous tissue type (8
) (, Figure S1
Figure 1 CEBPA binding in vivo in livers isolated from five vertebrate species cross-mapped to the human PCK1 gene locus. A rare ultraconserved binding event is shown surrounded by species-specific and partially-shared binding events. On the left is the evolutionary (more ...)
CEBPA and HNF4A were selected as representative transcription factors within the liver-specific regulatory network because both are conserved and constitutively expressed with well-characterized target genes (10
). In addition, they represent distinct TF classes, and the DNA binding domains of each factor's orthologs are nearly identical among the study species (Figure S2
The genomic TF occupancy data were reproducible between different individuals of the same species (Figure S3
), and were validated using alternative antibodies (Figure S4
). Using a mouse carrying a human chromosome we confirmed that genetic sequence, and not diet, lifestyle, or environment, is the primary determinant of liver-specific TF binding (Figure S5
). Given their greater evolutionary distance, contributions from non-genetic sources could be higher in opossum and chicken.
We identified TF-bound regions using a dynamic programming algorithm, and our results were robust to different peak-calling thresholds (Methods, Figure S6, Figure S7, Figure S8
). To detect TF binding events shared among any combination of the five vertebrates, we used the Ensembl 12-way multi-species alignment (13
), which incorporates approximately half of each species' genome into the global alignments. Our findings did not substantially change with an alternate methodology that used pairwise alignments performed using a separate algorithm (Methods, Figure S6, Figure S7, Figure S8
Each transcription factor bound between 16,000 and 30,000 locations in each mammalian genome; CEBPA bound approximately half this number in the smaller chicken genome (, Figure S6, Figure S7, Figure S9
). For both factors, less than a quarter of bound regions were within three kilobases of known transcription start sites. Between 30% to 50% of the binding sites of the two transcription factors overlapped in the genome (Table S1
). These overlapping sites did not exhibit substantially different characteristics in the conservation of underlying genetic sequence than the sites of CEBPA and HNF4A considered individually.
Figure 2 Conservation and divergence of transcription factor binding. (A) For CEBPA and (B) HNF4A, the pair-wise distribution and numbers of binding events are shown as a pie chart distributed into: intergenic (red), intronic (yellow), exonic (blue), and promoter (more ...)
For these two liver-specific transcription factors, binding events appear to be shared 10%-22% of the time between mammals from any two of the three placental lineages we profiled, separated by approximately 80 million years of evolution (Figure S6, Figure S7
). This reveals a rapid rate of evolution in transcriptional regulation among closely related vertebrates. Nevertheless, the number of CEBPA and HNF4A transcription factor binding events shared between any two of our five study species is far greater than could have occurred by chance (Figure S10
We used the genome-wide binding of CEBPA in opossum to test the hypothesis that regulatory regions have diverged substantially between eutherian and metatherian mammals (14
). Opossum indeed showed dramatic changes in transcription factor binding, and only between 6-8 % of the genomic regions occupied by CEBPA in opossum liver align with CEBPA binding events also found in mouse, dog, and/or human liver. This divergence was even greater in chicken, which shared only 2% of CEBPA binding with human, demonstrating extensive and continuous rewiring of gene regulation during vertebrate evolution that corresponds to evolutionary distance.
Ultra-conserved noncoding regions are an intriguing discovery revealed by comparative genomic sequencing (15
). We identified ultra-shared interactions between CEBPA and the vertebrate genome as binding events preserved over the 300 million years of evolution and thus found in aligned positions in all five species: human, mouse, dog, opossum, and chicken. Using our most stringent threshold, a set of 35 binding events were found to be shared by all five vertebrate species, and these binding events are almost invariably near genes central to liver-specific biology (, Table S2, Table S3
, see also below). Although these ultra-shared binding events are close to important liver-specific genes, they make up less than 0.3% of the total CEBPA binding found in human.
About 250 direct functional HNF4A target genes have recently been identified using multiple independent methodologies in mouse and human, including perturbation analysis in both species (8
). We experimentally identified a similar set of transcriptional target genes whose expression is dependent on CEBPA in adult mouse liver by using a conditional knock-out strategy (11
). In mammals, the target genes for both transcription factors have a disproportionate fraction of binding events that are shared in at least two species (p-value > 1× 10−5
) (Table S4
). CEBPA binding near direct target genes did not overlap with the binding events shared by five species.
We further compared our results to a set of 53 regulatory sequences within known, authentic liver enhancers in human (Table S5
). Thirty-eight of these regulatory sequences were located within nine HNF4A-bound regions. CEBPA binding overlapped with five of these HNF4A-bound regions, and we also found five of the nine HNF4A binding events were bound by HNF4A in more than one species. Overall these findings suggest that functional targets are enriched for TF binding events found in multiple species.
Mammalian TF binding studies have suggested that functional enhancers show increased sequence constraint (17
). As expected, the relatively few binding events shared among three or five species showed increased sequence constraint. The sequence constraint, evaluated using Genomic Evolutionary Rate Profiling (GERP) scores (19
), in bound regions near functional targets was similar to that for all bound regions for both TFs and these results were robust to the method applied. Regions bound by both CEBPA and HNF4A have sequence constraint patterns similar to those found for each factor analyzed independently (, Figure S11
). In sum, TF binding events near functional targets showed enhanced sharing between species, without a corresponding increase in sequence constraint.
DNA binding specificities of transcription factors show remarkable diversity and complexity (18
), yet few studies have compared specificities of orthologous transcription factors among multiple species. The motifs we directly determined from experimental binding data showed that in vivo
bound consensus sequences remain virtually unchanged during vertebrate evolution despite most binding events being species-specific (, Figure S12
). Neither the quality of a bound motif, as determined by its similarity to the consensus, nor the regional ChIP enrichment, as measured by sequencing read depth, was correlated with the conservation of TF binding events (Figure S13
Figure 3 DNA binding specificities of CEBPA and HNF4A are highly conserved during vertebrate evolution. (A) The known sequence motifs were identified de novo in each species interrogated (Methods), and found within almost all binding events (see Figure S12). (more ...)
Searching for the sequence features that are associated with shared binding events, we discovered that binding events shared by more species contain more aligned motifs (). These shared regions represent examples of deeply conserved regulatory architecture featuring multiple motifs at specific sequence locations maintained through vertebrate evolution. The most conserved of these, the five-way ultra-shared sites, also exhibit the strongest sequence constraint ().
Figure 4 Lineage-specific loss and turnover of transcription factor binding events. (A) The unbound regions in each placental mammal that align to regions showing TF binding in the other two placental mammals were collected, and the mechanisms by which the underlying (more ...)
To explore the genetic mechanisms underlying the divergence of transcription factor binding, we identified potentially lost CEBPA and HNF4A binding events. A binding event was assumed to be lost if it was not present in one placental mammal, yet was experimentally found at aligned, orthologous regions in the other two placental mammals. Using parsimony, this situation is best explained by an ancestral TF binding event present before the mammalian radiation that was subsequently lost along one lineage.
The lost binding events were categorized by the sequence changes to the alignable binding motifs within the orthologus regions of the other species (). Between 20 and 40% of the motifs associated with lineage-specific binding event losses were unchanged. These regions may represent cases of epigenetic redirection, yet-to-be characterized SNPs or indels, or loss of nearby genomic binding partners. A larger fraction of the absent binding events were associated with motifs whose disruption could be assigned to base pair substitutions, indels, and gaps in the alignment. Across all the vertebrate species, indels appear to be associated with loss of the underlying sequence motif a third as often as mismatches. A four-mammal analysis using opossum as an outgroup afforded similar results (Figure S14
). Analogous mechanisms appear to explain species-specific gains of transcription factor binding events (Figure S15
). Taken together, the steady accumulation of small changes in the genetic sequence appears to rapidly remodel thousands of transcription factor binding sites.
Approximately half of lineage-specific losses of TF binding showed evidence of nearby compensatory binding events (). A quarter of species-specific losses had a nearby (+/−10kb) gained binding event unique to the same lineage (unshared turnover), and an additional quarter of the losses had a nearby binding event that is shared in one or more other species (shared turnover) (Figure S16
). The latter case suggests the existence of a cluster of binding events in the common ancestor. In both cases, the probability of finding a turnover decreased rapidly with distance from the loss (Figure S16
), but a shared turnover was typically closer to the site of the loss than was an unshared turnover (p-value <1.0e-10 (CEBPA) and p-value <1e-15 (HNF4A)).
Understanding the evolutionary dynamics of transcription factor binding is essential to understanding the evolution of gene regulation. Many comparative genomics approaches assume that a multi-species alignment of a high quality motif is indicative of functionality (19
). Our analysis of experimentally determined in vivo
occupancy of two TFs in multiple vertebrates revealed apparent limitations to this model and a number of other insights about the complex relationship between genetic sequence, transcription factor binding, and genome regulation.
First, the vast majority of ChIP-identified transcription factor binding events are unique to each species; in mammals, the binding events that occur within species-specific, repetitive DNA are more common than conserved binding events. Second, ultrashared TF binding events, which are the functional counterpart of ultraconserved sequences, appear rarely in vivo among all five vertebrates. Third, only approximately half of binding events that are lost in one placental mammal yet present in at least two others are potentially recovered by nearby turnover events. Fourth, neither motif nor strength of TF binding correlate with conservation of a transcription factor's genomic occupancy. Alterations in the DNA binding specificity of CEBPA and HNF4A cannot account for rapid binding divergence, nor can species-specific environmental differences.
Nevertheless, comparing binding events within 10 kb of the transcription start site (TSS) of experimentally determined target genes of CEBPA and HNF4A has shown that binding events near these genes are more likely to be shared with other species, although this does not correspond to an increase in sequence constraint. In fact, the set of the ultra-shared, five-way binding events is entirely disjoint from the set of genes directly dependent on CEBPA in adult liver. For HNF4A, only 6% of binding events shared across three placental mammals () are near the highest-quality functional target genes, namely, those genes that depend on HNF4A for proper expression in both mouse and human . Given that most TFs are active in multiple cell types (26
), it is possible that the remaining shared sites are active in other tissues or other developmental stages. Indeed, the ultra-shared CEBPA binding events are uniformly found near liver-specific genes that would be expected to be upregulated upon liver organogenesis. Conversely, those binding events near functional targets in adult liver that are neither shared nor show signs of sequence constraint may represent lineage-specific regulatory interactions.
The preponderance of specific-specific binding and the rapid lineage-specific loss of binding events suggests that a sizeable majority of specific TF-DNA interactions could be evolving neutrally. Liver-specific TFs and subsequent gene expression are both highly conserved, the rapid gain and loss of binding events may be indicative of compensatory changes that maintain local concentrations of TF binding near functional targets (27
). Indeed, a recent computational approach which uses a high concentration of TF binding motifs, regardless of their alignment, showed improved ability to predict regulatory interactions (28
Despite the rapid gain and loss of TF binding events in mammals, tissue-specific gene regulation seems to be maintained by identifiable regulatory architectures that can be independent of sequence constraint.