Repetitive DNA and nucleosome binding affinity
Experimentally characterized human gene proximal promoter sequences (n
= 7,913) were taken from the Database of Transcriptional Start Sites (DBTSS) (Suzuki et al., 2002
) and analyzed with respect to their repetitive DNA content and nucleosome binding affinities. The locations of repetitive DNA elements along promoter sequences were determined by the RepeatMasker program and nucleosome binding affinities were predicted using the method of Segal et al. (Segal et al., 2006
). Two classes of repetitive DNA were analyzed separately: interspersed repeats, also known as transposable elements (TEs) and simple sequence repeats (SSRs), which are made up of runs of exact or nearly exact repeating k
-mers. For each promoter position, from 1kb upstream to the transcriptional start site (TSS), the average TE and SSR densities over all promoter sequences were calculated as the fraction of sequences for which that position was occupied by a TE or SSR. Average nucleosome binding affinities across promoter positions were calculated as the fraction of sequences for which a given position was predicted to be occupied (bound) by a nucleosome. Average nucleosome binding affinities and the average TE density follow parallel trends along the proximal promoter regions (). Nucleosomes bind more tightly and TEs are found more frequently distal to the TSS, whereas nucleosomes bind promoter sequences most proximal to the TSS with lower affinity and TEs are rarely found close to the TSS. SSRs show a distinctly different trend with a higher density close to the TSS that corresponds to the decrease in nucleosome binding affinity. The SSR density matches the nucleosome binding even more closely than the TE density just upstream of the TSS. Nucleosome binding affinities decrease steadily from distal regions until ~35bp upstream of the TSS, then the nucleosome binding affinity increases towards the TSS. Similarly, the SSR density increases to the same point and then drops off as the nucleosome binding affinity increases (). This core promoter region where nucleosome binding affinity is at its lowest and SSR density is at its highest corresponds to the location where the basal transcriptional machinery assembles, and RNA polymerase II binds, to initiate transcription.
Repetitive DNA density and nucleosome binding affinity along human proximal promoter sequences
The correlations between nucleosome binding affinities with TE and SSR densities along human proximal promoter regions are robust and highly statistically significant (). Previously, we observed that nucleotide composition changes markedly along human proximal promoter sequences with an increase in CpG frequency close to the TSS (Marino-Ramirez et al., 2004
), while the nucleosome binding prediction method we employed in this analysis relies on the periodicity of AT-rich dinucleotides (Segal et al., 2006
). Thus, it is possible that the high (low) nucleosome binding affinity of TE (SSR) sequences in proximal promoter regions is a corollary effect of local differences in nucleotide composition. We attempted to control for this possibility in several ways. First of all, average nucleosome binding affinities were computed for all TE, SSR and non-repetitive sequences irrespective of their locations along proximal promoter regions. On average, TE sequences bind nucleosomes most tightly, followed by non-repetitive DNA and SSRs, which have the lowest nucleosome affinities (); all differences are highly statistically significant (ANOVA F
= 4.5e11, P
Nucleosome binding properties for repetitive versus non-repetitive DNA
In addition to the binding affinity observations that are based on the nucleosome prediction software, we also analyzed the nucleosome wrapping characteristic AA/TT/TA dinucleotide frequencies along experimentally characterized nucleosome bound sequences from chicken (Satchwell et al., 1986
) that we identified as being derived from TEs (n
= 39). The chicken TE sequences show the characteristic AA/TT/TA dinucleotide periodicity expected of nucleosome bound sequences (); in fact, the average distance between dinucleotide peaks for these TE sequences is ~10.3bp, which is close to the expected distance of 10.2bp corresponding to one turn of the DNA helix (). This is significant because DNA sequences are thought to wrap around nucleosomes by bending sharply at each repeating turn of the DNA helix, and this sharp bending is facilitated by the specific AA/TT/TA dinucleotides (Widom, 2001
We also attempted to control for nucleotide composition effects by randomizing promoter sequences and re-calculating nucleosome binding affinities. First, entire 1kb promoter sequences were randomized and nucleosome binding affinities re-calculated. This control procedure has the effect of eliminating both native dinucleotide occurrences and local nucleotide composition biases. The average nucleotide binding affinity for such randomized promoter sequences (nba = 0.16) is ~3x lower than seen for the observed promoter sequences (nba = 0.49), and the difference between random and observed affinities is highly significant (t = 23, P = 5.3e-100). In addition to differences in the magnitude of the nucleosome binding affinities, the relative affinity trends along the promoters were compared for the random versus observed sets. Partial correlation was used to control for the effects of the random sequences on the observed relationship between nucleosome binding affinity with TE and SSR densities along proximal promoters. The positive (negative) correlations between nucleosome binding for TE (SSR) do not change when the correlations between random sequences and nucleosome binding along the promoters are accounted for [rnba·TE|random1 = 0.99 and rnba·SSR|random1 = -0.85].
A second randomization procedure was done to account for local differences in nucleotide composition along proximal promoter sequences. In this case, sequences were randomized within non-overlapping 100bp windows along the promoters. This had the effect of eliminating native dinucleotide occurrences while maintaining local nucleotide composition biases. As with the complete sequence randomization procedure, the locally randomized sequences have significantly lower nucleosome binding affinities (nba = 0.23) than the observed sequences (nba = 0.49), and this 2.1x difference is highly statistically significant (t = 17, P = 5.0e-55). Clearly, local nucleotide composition alone can not explain the observed nucleosome binding affinities. However, the relative trends in nucleosome binding show different local nucleotide composition effects for TEs versus SSRs. The partial correlation controlling for the effects of local nucleotide composition on the relationship between TE density and nucleosome binding eliminates the positive correlation seen across the entire promoter for the observed data [rnba·TE|random2 = -0.14]. This suggests that local nucleotide composition bias influences the decreasing trend in nucleosome binding affinities along proximal promoters irrespective of TE density. Interestingly, this same mitigating effect of local nucleotide composition is not seen for the relationship between SSRs and nucleosome binding [rnba·SSR|random2 = -0.53]. This suggested the possibility that most of the local nucleotide composition bias effect on the relationship between TEs and nucleosome binding may be confined to the region closest to the TSS where TEs are largely absent and SSRs are at their most dense (). Indeed, when partial correlation controlling for local nucleotide bias is done excluding 150bp upstream of the TSS, the positive correlation between TEs and nucleosome binding affinity remains [-1000 to -150 rnba·TE|random2 = 0.76]. In other words, positive TE effects on nucleosome binding are most evident away from the TSS, while the SSRs that inhibit nucleosome binding act closest to the TSS.
Taken together, these data suggest the intriguing possibility that the human genome utilizes repetitive DNA content along promoter regions to tune nucleosome binding in such a way as to facilitate maximum access of the basal transcriptional machinery just upstream of TSS. Furthermore, different classes of repeats play distinct roles in this process; TEs bind nucleosomes tightly yielding compact less accessible DNA, while SSRs extrude nucleosomes creating a relatively open chromatin environment.
In addition to the human genome analysis, the relationship between nucleosome binding and repetitive DNA content of proximal promoter regions was evaluated for four additional mammalian species with complete genome sequences available: chimpanzee (Pan troglodytes
), mouse (Mus musculus
) and rat (Rattus norvegicus
). For these species, NCBI Refseq gene models were used to define TSS and proximal promoter regions, while TE and SSR repeats and nucleosome binding were analyzed as was done for the human genome. The trends observed for human are highly similar to those seen for the other mammalian species (Supplementary Figure 1
). In chimpanzee, mouse and rat, nucleosome binding affinities decrease steadily along the proximal promoter region until the core promoter, <50bp from the TSS, where nucleosome binding begins to increase. For these three species, TE density drops precipitously and steadily along the proximal promoter while SSR density increases sharply at first in the core promoter near the TSS and then drops off again as nucleosome binding affinity increases. Thus, repeat- rich mammalian genomes appear to use repetitive DNA elements to tune nucleosome binding and core promoter accessibility in similar ways. The conservation of the relationship between repetitive DNA and nucleosome biding in core promoters of several mammalian species suggests that this mechanism may have evolved early in the mammalian radiation as repetitive elements were proliferating within genomes. However, many of the repetitive elements that yield these patterns evolve rapidly and are lineage-specific. Accordingly, there may be an ongoing dynamic between repeat generation by mutation and/or transposition followed by selection based on the promoter location of the repeat and specific requirements for chromatin accessibility. For TEs in particular, this could simply mean that the elements are eliminated from core promoter regions close to the TSS by purifying selection. Indeed, negative selection against TE insertions closest to TSS would seem to be the easiest way to explain the observed pattern of TE density ( and Supplementary Figure 1
). However, our analysis of gene expression data, described in following sections, suggests that this is not the case. SSRs, on the other hand, appear to be favored in core promoter regions.
TE-specific effects on nucleosome binding affinity
The Repbase library of repetitive DNA elements used by the program RepeatMasker can be used to annotate TEs into different classes and families (Jurka et al., 2005
; Kapitonov and Jurka, 2008
). Using this approach, human TE sequences were divided into LINEs, (L1 and other LINES), SINEs (Alu and other SINEs), LTR retrotransposons, and DNA transposons to determine if different classes (families) of elements show differential nucleosome binding affinities (). In general, LINEs, LTR retrotransposons and DNA transposons have higher affinities for nucleosomes compared to SINEs. Specifically, L1 elements exhibit the highest nucleosome binding affinities while Alu elements display the lowest affinity for nucleosomes. All differences are statistically significant (, ANOVA).
Average* nucleosome binding affinities for TE classes (families)
The differences in nucleosome binding affinities between L1 and Alu are consistent with their respective nucleotide compositions and perhaps also relevant to their genomic distributions. L1 elements, and LINEs in general, are more AT-rich than Alus (SINEs), and AT-rich sequences are more likely to bind nucleosomes tightly as discussed previously. L1 elements are also biased towards intergenic regions in their distribution, while Alu elements are found primarily in gene rich regions. In fact, it has been shown that Alus are preferentially retained in GC- and gene-rich regions of the genome, and this has been taken to suggest that they may be selectively fixed therein by virtue of some gene-related function that they play (Lander et al., 2001
). Our data showing lower nucleosome binding for Alu elements suggests that they may be retained in gene regions by virtue of their ability to maintain a relatively open chromatin environment. Conversely, L1 elements may help to maintain compact chromatin structures characteristic of intergenic regions.
Promoter repeat architecture and gene expression levels
In light of the observed relationship between repetitive DNA elements and nucleosome binding, we used the repetitive DNA content of proximal promoter regions to group human genes into related clusters. The gene expression and functional properties of the clusters were then compared to their characteristic repeat architectures. To cluster human genes using their promoter repeat distributions, proximal promoter sequences were represented as 1,000-unit vectors with each position in a sequence-specific vector receiving a score indicating whether that particular nucleotide is part of a TE, SSR or non-repetitive sequence. These gene-specific promoter repeat vectors were then compared using a distance metric and clustered as described (Materials and Methods). This approach ensured that the clusters reflect both the abundance, or lack thereof, and the location of distinct repetitive DNA elements in human promoter sequences. In other words, this scheme relates human genes solely by virtue of their promoter repeat distributions.
We obtained six repeat-specific clusters of human genes in this way (), each cluster representing a distinct overall pattern of TE and/or SSR content and distribution. Two of these clusters (c1 & c2, TE−) consist of genes that are largely devoid of TEs, while four consist of genes with increasing TE densities (c3 — c6, TE+). c1 does not contain any repetitive DNA, while c2 is enriched in SSR sequences and has very low TE content. c3 — c6 have progressively more TE content with locations shifting slightly towards the TSS.
Clusters of human proximal promoters based on their repetitive DNA sequence distributions
The gene expression properties of the human genes in these clusters were analyzed using version 2 of the Novartis mammalian gene expression atlas (GNF2) (Su et al., 2004
). This data set consists of Affymetrix microarray experiments, performed in replicate, on 79 different human tissue (cell) samples. For each human gene, over 79 tissues, we computed the average expression level, maximum expression level and breadth of expression as described (Materials and Methods); cluster-specific averages for each of these parameters were then compared (). We were surprised to find that clusters that contain TEs (c3 — c6, TE+) have higher average, maximum and breadth of expression than clusters that are largely devoid of TEs (c1 & c2, TE−). Gene expression levels are known to correlate with a number of measures of gene ‘importance’ such as sequence and phylogenetic conservation, fitness effects, numbers of protein interactions etc (Duret and Mouchiroud, 2000
; Pal et al., 2001
; Krylov et al., 2003
; Zhang and Li, 2004
; Wolf et al., 2006
). In other words, genes that are more highly and broadly expressed are under greater purifying selection than genes with lower expression levels. If TEs are eliminated from proximal promoter sequences by purifying selection, then one may expect that TE+ promoters would have lower, and not higher as we observe, levels of gene expression than TE− promoters. In other words, our analysis of repeat cluster gene expression levels argues against the straightforward interpretation that the paucity of TEs in proximal promoter sequences, and their decreasing frequency closer to TSS, is a result of purifying selection against disruptive insertions in core promoters.
Gene expression comparison for TE− versus TE+ promoter clusters
On the other hand, one may expect that genes with more restricted and more tightly regulated expression, such as developmental genes, would have more TE sensitive promoters than genes that are highly and broadly expressed. In fact, developmental genes are known to have promoters that are largely devoid of TEs (Simons et al., 2006
; Simons et al., 2007
). This may reflect that fact that such genes are more finely and tightly regulated and accordingly contain more complex promoters with higher numbers of cis-regulatory elements. If this is indeed the case, then the paucity of TEs in proximal promoter regions may still be explained, to some extent, by purifying selection against disruptive insertions. Discrimination between these two hypotheses regarding the selective elimination, or lack thereof, of proximal promoter TE sequences awaits further analysis.
Promoter repeat architecture and tissue-specific gene co-expression
In addition to analyzing repeat cluster gene expression levels, we also evaluated the relationship between the tissue-specific expression patterns of genes across the 79 tissues from GNF2 and their promoter repeat content. To do this, gene-specific vectors of expression levels across tissues were compared using the Pearson correlation coefficient (r); positive values of r indicate gene pairs that are co-expressed across tissues. For each cluster, average r-values were computed based on all pairwise comparisons within the cluster (). Higher average r-values are associated with increasing TE promoter content of the clusters. For instance, there is a positive (R = 0.77), albeit marginally significant (z = 1.72, P = 0.1), rank correlation between cluster TE content and co-expression. In addition, all four TE+ clusters have greater average co-expression than either of the TE− clusters, and the average r-value for TE+ clusters together is significantly greater than seen for the combined TE− clusters ().
Gene co-expression for repeat-specific proximal promoter clusters
The possibility of gene co-regulation within repeat clusters was also evaluated by taking the difference between the average r-value for all pairwise comparisons within clusters to average pairwise r-value for all gene comparisons (Materials and Methods). If genes within clusters are co-regulated, then the value of this difference should be positive, whereas no co-regulation will yield a negative difference value. The TE− clusters 1 & 2 have negative difference values indicating that genes with no TEs in their promoters are less co-expressed with other genes possessing a similar lack of repeats than they are with all genes. On the other hand, the TE+ clusters 3-6 all have positive difference values further demonstrating that genes with similar repetitive DNA profiles in their promoters are more closely co-expressed than are random pairs of genes. The difference values for each cluster are statistically significant (7.3>z>100.6, 1.4e-13<P<0).
Taken together, these observations on gene co-expression also argue against the notion that TE insertions in proximal promoter sequences are basically disruptive or deleterious, since the presence of similar TE promoter distributions implies a higher level of gene co-regulation than the absence of TEs does. This is not to say that the majority of de novo TE insertions in and around functional promoter sequences are not deleterious, clearly they are. However, the repeat sequences that have been fixed in proximal promoter sequences do appear to make functionally relevant contributions to chromatin accessibility and help to regulate levels and specific patterns of gene expression.
Probabilistic analysis of promoters and gene expression
Given the relationship between gene expression and the repetitive DNA architecture of human promoters we observed, we wanted to further evaluate the propensity of human genes to be expressed in specific tissues based on the repetitive DNA content of their promoters. To do this, we used a probabilistic representation of cluster-specific promoter architectures together with the GNF2 expression data. This involved partitioning 1kb proximal promoter sequences into 20 non-overlapping windows of 50bp each, and for a given cluster, representing the probability of observing TE, SSR or non-repetitive nucleotides in each window (Materials and Methods). The probabilistic representation of promoter repeat architectures we employed is mathematically analogous to the probabilistic representations of position weight matrices (PWMs) used to summarize position-specific residue frequencies among collections of sequence motifs such as transcription factor biding sites (Wasserman and Sandelin, 2004
). Accordingly, promoter repeat profiles can be represented as sequence logos showing the probability and distribution for sites of different repeat classes (Supplementary Figure 2
). The cluster-specific promoter repeat profiles can then be used to score individual promoter sequences just as PWM representations can be used to score putative motif sequences. Connecting these cluster- and position-specific promoter repeat profiles to tissue-specific gene expression profiles was done in a way that is similar to the methodology used to connect the presence of transcription factor binding site motifs to specific gene expression patterns (Conlon et al., 2003
For each of the 79 tissues in GNF2, each promoter sequence was given six cluster-specific scores, and for each cluster, the gene-specific scores were correlated with the tissue-specific gene expression levels (Materials and Methods). This resulted in a 6-by-79 matrix of cluster-by-tissue correlations (). The TE+ clusters 4 & 6 show particularly high correlations with a number of tissues, such as B lymphoblasts (), whereas the TE− clusters 1 & 2 show low correlations with the same tissues and lower correlations overall. This indicates that certain repeat-rich promoter architectures play a role in driving tissue-specific expression, while repeat poor promoters have less coherent regulatory properties. In addition, the differences in promoter score-expression level correlations across tissues and between clusters indicate that different repeat contexts are likely to have tissue-specific regulatory functions. Hierarchical clustering of the tissues and the clusters, according to the promoter score-expression level correlations, groups related tissues together including reproductive tissues, immune related cells and cancer samples (). This indicates that TE-rich promoters may help to regulate genes that function specifically in these tissues further underscoring the biological significance of promoter sequence repetitive DNA profiles.
Promoter repetitive DNA architecture and tissue-specific gene expression
Gene Ontology analysis
Having established a connection between repetitive DNA promoter architectures and gene regulation, we wondered whether genes with similar promoter repeat distributions encoded proteins with related functions. In order to test this, we used analysis of Gene Ontology (GO) terms for genes within and between the TE− versus the TE+ repeat-specific promoter clusters (). A modified version of the GO semantic similarity measure (Lord et al., 2003
; Azuaje et al., 2005
) was used to compare the similarities between GO terms within clusters versus the background GO similarity among all pairs of genes. As described previously (Marino-Ramirez et al., 2006
; Tsaparas et al., 2006
), the GO semantic similarity approach measures the pairwise similarity between annotation terms along the GO directed acyclic graph in order to evaluate the functional similarity between pairs of genes. For TE− and TE+ genes, the GO similarity difference (GOdiff
) is equal to the average GO similarity for all gene pairs within clusters minus the average GO similarity for all possible gene pairs (Materials and Methods). Negative values of GOdiff
indicate that gene pairs are more similar within clusters than for all possible pairs. Both the TE− and TE+ gene sets encode proteins that are significantly more functionally similar than the background comparison set [TE− = -3.4e-3, z
= 34, P
0; TE+ = -7.9e-3, z
= 11, P
= 4.8e-3]. However, within the TE+ clusters, pairs of genes encode proteins that are significantly more functionally similar, on average, than the pairs of genes found within the TE− clusters (t
= 5.8, P
= 6.4e-9). This is consistent with the stronger signal of gene co-regulation seen for clusters of promoter sequences that are enriched for TEs and underscores the potential biological significance of repeat-rich promoter sequences in the human genome.
Given the functional coherence of repeat-specific clusters demonstrated by the GO similarity analysis, we wanted to evaluate whether certain GO functional categories are over-represented within specific clusters. To do this, we traced the GO terms represented in the dataset to GO slim terms (). GO slim categories provide a higher level view of more granular individual GO annotations in order to provide an overview of the kinds of functions that may be over-represented in different groups. The observed counts of GO slim categories for each of the six repeat-specific clusters, as well as for the combined TE− and TE+, groups were compared to their expected values based on the background GO slim frequencies across all clusters to look for over-represented terms. Genes in the electron transport, cytoplasm, catalytic activity and oxidoreductase activity categories were found to be over-represented in TE+ clusters and accordingly under-represented in the TE− clusters, whereas genes in cell communication, multicellular organismal development, regulation of biological process and transcription regulator activity categories are over-represented in TE− clusters and under represented in TE+ clusters. Evaluation of over-represented GO terms in individual clusters reveals coherence across the three categories of GO terms: molecular function, cellular component and biological process. For instance, the TE+ cluster 5 has an over-represented receptor and transporter activities in the molecular function category that agree with the cell surface cellular component term and the response to stimulus biological process term. The over-represented catalytic activity molecular process term for the most TE-rich cluster 6 corresponds to a cytoplasmic cellular component term along with metabolic and biosynthetic biological process terms. In a general sense, the coherence of GO functional annotations within repeat-specific clusters and the differences between clusters are consistent with biological significance of the regulatory differences seen for these clusters.
Over-represented* GO slima terms for repeat-specific promoter clusters.
We have uncovered a connection between repetitive DNA sequences and nucleosome binding in human proximal promoter regions along with an influence of repetitive DNA promoter sequences on specific patterns of gene expression. Interestingly, different classes of repetitive elements function differently to mediate nucleosome binding; TEs bind nucleosomes tightly and are generally excluded from core promoter regions, while SSRs have a low affinity for nucleosomes and are enriched just upstream of TSSs. Thus, it appears that repetitive sequence elements are differentially utilized to tune the accessibility to promoter sequences by transcription factors, particularly the basal transcriptional machinery that assembles just upstream of the TSS, via changes in the local chromatin environment.