X-chromosome inactivation provides a singular opportunity to investigate the potential relationship between sequence elements and the structural and functional transformation of essentially a whole chromosome. Since at least hundreds or thousands of XIST transcripts bind across the chromosome, sequence motifs involved would likely be highly represented throughout the genome, potentially making it difficult to discriminate them from putative “junk”. It was long ago suggested that repetitive sequences may be involved in promoting chromosome inactivation (Gartler and Riggs, 1983
), with LINE elements a suspect, as noted above (for review, see (Lyon, 2003
). Using bioinformatic sequence analysis, Bailey et al. reported that the human X-chromosome has a ~1.7 fold higher level of L1 LINE elements than the autosomal average except in the region at Xp22 that escapes chromosome inactivation (Bailey et al., 2000
). While this correlation is consistent with an involvement of L1 elements in promoting X-inactivation, L1 may also have accumulated on the X due to its lower meiotic recombination in all but the pseudoautosomal region (which also lacks L1 enrichment). Other studies of canonical repeats concluded that L1 elements are either not likely involved (Chureau et al., 2002
); (Ke and Collins, 2003
) or may not be solely responsible (Ross et al., 2005
Rather than focus on candidate elements, we took an open, unbiased bioinformatic search for any motifs that are abundant, widely distributed, and specifically enriched on the X chromosome. Additionally, analyses were performed with and without masking for known interspersed repeat families (e.g., LINEs, SINEs, and LTRs), since these copious elements may well contribute, but would also likely obscure other repeated motifs. Using a linguistic approach, the number and distribution of all nine base-pair words in genomic sequence of all individual
human chromosomes was examined. In addition, we focused on the X chromosome as comprising two distinct segments: XE, a ~10 Mb region at Xp22.3 ( & ) that includes the pseudoautosomal region and more fully escapes X inactivation (Carrel and Willard, 2005
), and XS, the remainder of the chromosome which is largely silenced on Xi. While there are some genes scattered throughout XS that partially escape silencing in some cell types, XE is a large, unique chromosomal domain that is wholly resistant to X-inactivation, unlike autosomal chromatin which has substantial capacity for inactivation.
L1 LINE density versus gene density across all human chromosomes and the XE escape domain
The pseudoautosomal region, which fully escapes inactivation on the Xi, exhibits a striking 11 fold enrichment in the GATA repeat sequence
This analysis revealed several substantial new features of X chromosome sequence content. First, we confirmed that L1 is enriched on XS, but extended this to show that this enrichment on X is distinct from all individual autosomes (not just the autosomal average) (). This was important since individual chromosomes can vary substantially; for example, gene rich Chr 19 is especially depleted in L1 elements in contrast to Chr 4. Notably, the rest of our findings identified differences in simple sequence repeats, which are typically excluded from such analyses. Results showed that the dinucleotide repeats [AT]n, [AC]n, and [AG]n are significantly enriched across the X chromosome compared to autosomes. Intriguingly, these repeats have the property of being able to form unusual DNA structures, which potentially could contribute to the regulation of facultative heterochromatin.
Most importantly, this analysis uncovered a dramatic difference in the content of small simple repeats scattered throughout the whole region. As shown in , a striking enrichment (> 10 fold) of (GATA)n
repeats distinctly marks the 10 Mb segment at Xp22 that escapes inactivation, which is confirmed by FISH with an oligo GATA probe, and which, importantly, is also seen in other eutherians (McNeil et al., 2006
). These findings suggested a new paradigm whereby a regional escape from X-inactivation in a large chromosomal domain may be due to the presence of elements that prevent heterochromatinization, rather than simply lack of elements that promote it. The GATA repeats are clearly a marked and conserved feature dispersed at many sites throughout the “fabric” of this large chromosomal segment; in fact, further analysis showed that no other 10 Mb chromosomal segment in the genome showed such a striking enrichment for any 9-mer word (McNeil and Lawrence, unpublished). Thus this strongly suggests that the GATA repeats are involved in the unique biology of this region, in either escape from silencing or potentially the obligatory meiotic recombination of this region in the XY body, or both (McNeil et al., 2006
). Recent literature provides other examples in which gene regulation appears to be coordinate across a chromosomal domain, such as in hESC (Li et al., 2006
), or silencing of a tumor suppressor gene in a band-sized chromatin domain (Frigola et al., 2006
). We suggest that the broader sequence context of a chromosomal region may increasingly prove important in gene regulation, and the chromosomal domain may in part be defined by the repeated motifs or “words” that populate it.
repeats clearly mark this unique chromosomal domain that more fully escapes silencing, GATA enrichment is not seen for the individual genes that partially or more variably escape inactivation throughout the rest of XS (McNeil et al., 2006
). These may be regulated by a distinct mechanism, since in mouse the Jarid1C gene (and two other individual genes that escape silencing) were flanked by CTCF binding sites (Filippova et al., 2005
), and Li and Carrel (2008) further showed this was an intrinsic property of the Jarid1C locus. Carrel et al. (2006) and Wang et al. (2006) each published that computer profiling could recognize motifs that predict genes that escape silencing even outside the XE region. While no discrete consensus motif was identified, this provides further evidence that XIST RNA is not indifferent to chromosomal sequence context.