|Home | About | Journals | Submit | Contact Us | Français|
A subtelomeric region, 4q35.2, is implicated in facioscapulohumeral muscular dystrophy (FSHD), a dominant disease thought to involve local pathogenic changes in chromatin. FSHD patients have too few copies of a tandem 3.3-kb repeat (D4Z4) at 4q35.2. No phenotype is associated with having few copies of an almost identical repeat at 10q26.3. Standard expression analyses have not given definitive answers as to the genes involved. To investigate the pathogenic effects of short D4Z4 arrays on gene expression in the very gene-poor 4q35.2 and to find chromatin landmarks there for transcription control, unannotated genes and chromatin structure, we mapped DNaseI-hypersensitive (DH) sites in FSHD and control myoblasts. Using custom tiling arrays (DNase-chip), we found unexpectedly many DH sites in the two large gene deserts in this 4-Mb region. One site was seen preferentially in FSHD myoblasts. Several others were mapped >0.7 Mb from genes known to be active in the muscle lineage and were also observed in cultured fibroblasts, but not in lymphoid, myeloid or hepatic cells. Their selective occurrence in cells derived from mesoderm suggests functionality. Our findings indicate that the gene desert regions of 4q35.2 may have functional significance, possibly also to FSHD, despite their paucity of known genes.
One of the best cell culture models for mammalian differentiation is the induction of myotube formation from cultured myoblasts. However, only one high-resolution study of chromatin has been reported for human myoblasts, namely, analysis of histone H4 hyperacetylation on a single fetal myoblast cell strain using a commercial SNP tiling array (1). High-resolution analyses of other human cell types for chromatin epigenetics and annotation-neutral searches for transcripts are revealing evidence for many new differentiation-associated genes, alternative transcription start sites and transcription control elements within genes or located distant from them (2–6).
We have begun analysis of chromatin structure in myoblasts by focusing on 4q35.2, the subtelomeric chromosomal region that contains a muscular dystrophy-linked repeat array, D4Z4. The relationship of D4Z4 to pathogenic gene dysregulation in facioscapulohumeral muscular dystrophy (FSHD) is still enigmatic. FSHD is the only known disease caused by having additional few copies of a long, tandemly repeated sequence (7). More than 95% of FSHD patients have only 1–10 copies of this 3.3-kb repeat unit on one allelic 4q35.2 (Figure 1A), while unaffected individuals have 11–100 copies on both 4q35.2 alleles. This progressive and painful disease is usually diagnosed in the teens. It is initially confined to certain groups of skeletal muscles and has no efficacious treatment. An extremely low copy number of 4q35 D4Z4 repeats (1–3) often correlates with an earlier onset and more severe disease but no FSHD-linked array has been found to have zero copies of the repeat unit (7).
Several findings implicate the involvement in FSHD of sequences in cis to a short D4Z4 array on 4q35.2. A short D4Z4 array by itself does not cause FSHD because almost identical arrays on 10q26.3 have no phenotypic effect even when they are equally short (7). This 4q-dependence of pathogenicity is found despite 98% homology between 4q and 10q within D4Z4 and >95% homology centromere proximally for 42 kb and distally for ~15–25 kb up to the telomere (Figure 1) (7,8). A very small number of single-base sequence variations distinguish canonical 4q35.2 D4Z4 repeat units from those of 10q26.3 (7). A short canonical 4q35-type D4Z4 array on 10q26.3 does not lead to FSHD while a short 10q26.3-type array on 4q35.2 does (9). Therefore, the chromosomal position of the 4q35.2 D4Z4 array links it to FSHD. Nonetheless, there is only controversial or incomplete evidence for functional linkage of FSHD to a 4q35-specific gene despite many expression microarray and real-time PCR (RT-PCR) studies (10–14).
We and others have focused attention on the proximal side of the D4Z4 array in subtelomeric 4q because no genes have been found in the short partially sequenced subregion distal to the array (7). Moreover, the short D4Z4-distal region shares high homology between 4q and 10q (15). Sequences at 4q35.2, which may be implicated in the disease, should be >78 kb proximal to D4Z4 because naturally occurring deletions of sequences within this ~78-kb proximal region (Figure 1) do not impact the phenotype of FSHD patients carrying a short D4Z4 array in cis (16). The furthest proximal sequence that is considered a disease candidate gene is SLC25A4 (ANT1), which is located on 4q35.1 ~5 Mb proximal to D4Z4. Disease-linked SLC25A4 overexpression was seen in some studies (17,18) but not others (11–13).
We proposed that there is a long-distance interaction between a short D4Z4 array and some unidentified proximal gene present on subtelomeric 4q, but not on 10q, that results in FSHD-specific transcription dysregulation (19,20). Long-range control of disease-related gene expression in humans can occur by looping interactions of ~1 Mb and even longer interactions may occur (21,22). The hypothesized long-range pathogenic interaction of short D4Z4 arrays and other 4q sequences in cis should involve a ‘molecular ruler’ that recognizes differences in sizes of D4Z4 arrays slightly above or below a near threshold of 33 kb (10 3.3-kb repeat units). Because it is the most likely region for the postulated cis-interactions, the 4 Mb of 4q35.2 with the addition of the SLC25A4 gene at 4q35.1 were the focus of the present study of chromatin in FSHD and control myoblasts. About 80% of 4q35.2 is devoid of known genes, including no reported micro RNA (miRNA) genes. Approximately 25% of the genome consists of gene-poor regions >500 kb termed gene deserts (23). Studying 4q35.2, which is mostly gene desert, should enhance our understanding of the underlying chromatin structure of the human genome as well as the molecular genetics of FSHD.
Using FSHD and control myoblast cell strains, we mapped DNaseI-hypersensitive (DH) sites at high resolution. DH sites are associated with nucleosome-free chromatin and various gene regulatory elements (24). Approximately 30% of them are in promoters of genes that are active or poised for activity (6,24). Crawford and colleagues (2,6,25) developed two specific high-throughput methods to identify large numbers of DH sites at once, using tiled arrays (DNase-chip) or high-throughput sequencing (DNase-seq). We analyzed myoblasts with DNase-chip, which employs custom tiling arrays and is ideally suited for analysis of targeted regions of the genome. Because DNase-chip and DNase-seq are highly correlated and display similarly high levels of sensitivity (92%) and specificity (94%) (6), we could also compare data from DNase-chip on FSHD and control myoblasts and with DNase-seq mapping of DH sites from other cells types.
Myoblast cell strains from FSHD patients (F1, 2 and 3; 28-year-old male, 18-year-old female and 14-year female, respectively) were derived from moderately affected, deltoid or quadriceps biopsies of FSHD patients. Their disease-linked D4Z4 arrays had three, three and two 3.3-kb 3 U, respectively. The control myoblasts (C1, 31-year female, 31 years; C2 and C3, two different batches of myoblasts from a 27-year-old male) were from similar biopsies of unaffected individuals. These individuals were unrelated except for the 27-year-old unaffected male, who was the brother of patient F1. Duly signed patient consent forms were obtained that had been approved by the Institutional Review Boards of Tulane Health Science Center and the University of Mississippi Medical Center in Jackson. Myoblasts were propagated and checked by immunocytochemistry for desmin (20), a marker for muscle cells; >90% of the cells in the batches used for these experiments were desmin-positive.
Total RNA was extracted with TRIzol reagent (Invitrogen) and treated with DNaseI (Turbo DNA-free, Ambion). cDNA was synthesized (SuperscriptIII, Invitrogen) using random hexamer primers. Quantitative real-time polymerase chain reaction (qRT-PCR) was performed (SYBR Green Detection; iQ5, BioRad) with the following parameters: 95°C, 30 s; 63°C, 30 s; 72°C, 30 s for 45 cycles. For each sample analyzed, RT-minus controls were included. For each primer-pair (Supplementary Table S2), a standard curve with serial 10-fold dilutions of genomic DNA and the melting curve of the product were generated. The slopes of the standard curves were −3.3 ± 0.4 and the correlation coefficients were >0.98. The RNA level is represented as 1000 times the quantity (in arbitrary units) relative to that for human hypoxanthine phosphoribosyltransferase (HPRT).
DNase-chip was performed as previously described (25). Briefly, myoblasts from normal control and FSHD patients were lysed with NP40 and nuclei were lightly digested with optimal concentrations of DNaseI (Roche). High-molecular weight DNase-treated DNA was prepared, and DNase-digested ends were repaired by T4 DNA polymerase (New England Biolabs). Biotinylated linkers were ligated to the DNase ends and the ligation product was sonicated to an average size of 300–700 bp. DNase ends were captured on streptavidin beads (Invitrogen), sheared ends were blunted with T4 DNA polymerase and repaired ends were ligated to a second set of linkers. DNase-enriched material was amplified by PCR, labeled and hybridized to custom tiling arrays (NimbleGen). These tiling arrays contained probes from across 4q35.2 [chromosome 4 (chr4): 187 300 001–191 273 063; all positions are relative to hg18, UCSC Genome Browser] and the ANT1 locus (chr4:186 291 392–186 315 418). Because there are many repetitive sequences within these regions, we included probe sets that overlapped moderately repeated sequences, including probe sets from the 3.3-kb D4Z4 repeat, but not highly repeated sequences (http://www.repeatmasker.org/). All probes (average ~30-nt overlap) were designed using an isothermal probe selection strategy, where probes were 45–75 nt in length and were size-adjusted to give a Tm of 76°C. Due to the extremely high GC content (73%) and repetitive nature of the D4Z4 region, two additional sets of isothermal probes were designed to have a Tm of 79 or 82°C. DNase-chip data were analyzed as previously described (25,26). Data were visualized in custom tracks of the UCSC Genome Browser (http://genome.ucsc.edu/) or the Integrated Genome Browser (http://www.affymetrix.com/).
The terminal 3 Mb of the q arm of chr4 in 4q35.2 has the lowest gene density of all the autosomal q arms (Supplementary Figure S1). Within 4q35.2, there are two central gene deserts occupying 3.1 Mb and separated by three genes, ZFP42, TRIML1 and TRIML2 (Figure 2). ZFP42 is a marker for pluripotent stem cells (27). TRIML1/Triml1, which encodes a RING finger protein, is expressed in preimplantation mouse embryos (28). Similarly, the related provisional geneTRIML2 seems to be very restricted in its expression (http://biogps.gnf.org). Consistent with previous findings about the regions of low gene density (29), the 4q35.2 gene desert regions consist predominantly of low (G + C) isochores [IsoFinder (30)] and have low concentrations of short interspersed repeats (SINEs; Supplementary Figure S2).
We mapped DH sites in vivo in three FSHD and three normal-control myoblast cell cultures at 4q35.2 by DNase-chip. The DH fraction (labeled with Cy5) and the randomly sheared DNA (labeled with Cy3) were cohybridized to custom tiling arrays. Our DH mapping results indicated that 4q35.2 in myoblasts has three chromatin domains differing in the density of DH sites. The most proximal (DHS domain 1, Figure 2) had the highest density of DH sites seen at 4q35.2, in accord with its higher gene content. However, DH sites in this domain extended far into the adjacent gene desert. The distal domain (DHS domain 3) contains several genes near its telomeric end. Only one of these, FRG1, is expressed at a substantial level in myoblasts or other tested cell types (Table 1). Surprisingly, the DH sites observed in all myoblast strains in this domain included sites that extended 0.9 Mb from FRG1 into the neighboring gene desert. DHS domain 2, which is located in the middle of 4q35.2, had a much lower density of DH sites, and none was detected in more than two of the six myoblast cultures. The only genes in this domain are ZFP42, TRIML1 and TRIML2, which are associated with stem cells or early embryogenesis, as mentioned earlier.
Three DH sites, DH8, -9 and -10, were located near the boundary of DHS domains 1 and 2 and were situated ~0.4–0.5 Mb proximal to ZFP42 and ~0.75–0.9 Mb distal to FAT1, the nearest gene known to be expressed in the muscle (Figure 2). These DH sites contain unique DNA sequences and were of similar intensity in all FSHD and control myoblast cell strains (Figures 2 and and3A).3A). Recently, DH sites were identified across the whole genome from a number of human cell types by DNase-seq. This is a similar strategy to DNase-chip but uses next-generation sequencing (6). Even though DNase-seq and DNase-chip rely on different readout platforms, they have been shown to be highly correlated. The DNase-seq data have been generated as part of the ENCODE project and made available on the UCSC Genome Browser (http://genome.ucsc.edu/, Open Chromatin track). DNase-seq data from K562 (myeloid leukemia cell line), HepG2 (hepatocellular carcinoma cell line), GM12878 (lymphoblastoid cells), HeLa S3 cells and primary CD4+ T-cells, revealed no appreciable DH peak at the genomic positions of DH8, 9 or 10; however, two skin fibroblast cell strains did exhibit these peaks [Figure 3A and data not shown; all but the CD4+ data (6) are previously unpublished]. In some cell types, ~4 kb distal to DH10, a DH site was observed that overlapped a CCCTC-binding factor (CTCF) binding site identified by chromatin immunoprecipitation (ChIP) followed by next-generation sequencing (ChIP-seq) or tiling array analysis (Figure 3A).
Five genes in 4q35.2 (DUX4, DUX4C, FRG1, FRG2 and TUBB4Q; Figure 4) and one at 4q35.1 (SLC25A4/ANT1) have been considered as candidates for the 4q-specific pathogenicity of short D4Z4 arrays (10,17,32,33,35,36). Of these FSHD candidate genes, only FRG1 has easily detectable expression at the RNA level (Table 1). Overexpression of FRG2, FRG1 and SLC25A4 RNA in FSHD versus control muscle was reported to be >60-, >25- and ~10-fold, respectively (10,17). In addition, FSHD-associated elevation of protein levels was detected for SLC25A4 (18). Among the FSHD candidate genes on 4q35, DH sites were detected only at the promoters of FRG1 and SLC25A and no significant differences in DH peak intensity were observed between FSHD and control myoblasts (Figure 4, inset and data not shown). However, the biological significance of DH signal intensity is unknown and different DH sites display a wide range of openness (6). Moreover, by qRT-PCR, there was not a significant association of the relative concentration of FRG1 or SLC25A4 RNA with disease status in myoblasts. The relative steady-state RNA levels for FSHD versus control myoblasts were 1.4 for FRG1 and 2.0 for ANT1 (P > 0.5; assays of three FSHD versus three control myoblast cell strains in duplicate).
The ability to detect DH sites at promoters of TUBB4Q, DUX4C, FRG2 and DUX4 in the terminal 0.25 Mb of4q35.2 is complicated by this region containing many segmental duplications (Figure 4). With the exception of the 5′-end of FRG1 (37), analysis of all known 4q-terminal genes is difficult by almost any experimental method because of the near-absence of unique sequences. A further complication is that some of the sequences cross-hybridizing to this region of 4q35.2, especially within D4Z4 itself, are contained within incompletely sequenced regions of the genome, notably the short arms of the acrocentric chromosomes (38,39). Consequently, while there was sufficient probe coverage for TUBB4Q, DUX4C and FRG2 regions on the array, there was no coverage with unique probes.
Repeat-masked coverage for D4Z4, including its internal DUX4 gene, consisted of only scattered probes. DUX4 is the 1.6-kb homeobox-containing gene within each 3.3-kb D4Z4 repeat unit. Given the poor coverage with repeat-masking, we attempted to analyze the D4Z4 region by including probe sets that corresponded to these repetitive segments. DH sites in D4Z4 might be detected if all paralogs behaved similarly. However, the signal from D4Z4 probes did not identify any DH sites in this subregion (data not shown).
Among the six genes in 4q35.2 at the proximal end (Figure 2 and Table 1), DH sites in or near the promoter region were observed for FAT1, a cadherin gene involved in cell migration; CYP4V2, a cytochrome P450 family member and FAM149A, a RefSeq gene of unknown function encoding a hypothetical protein (Figure 3B, Supplementary Figures S3 and S4). Multiple DH sites in FAT1 introns were consistently seen in the myoblasts cultures (Supplementary Figure S3). No FSHD-associated differences were observed at any of these DH sites. The adjacent gene MTNRIA, a melatonin receptor gene, had a highly reproducible DH site within the first intron, but not at the 5′-end (Supplementary Figure S3). Accordingly, MTNRIA RNA was not detectable in FSHD or control skeletal muscle, as determined on cDNA expression microarrays (11). No DH sites were observed at two other proximal 4q35.2 genes, KLKB1 and F11, which encode proteins involved in blood coagulation. One of the few described mRNAs from the terminal 1 Mb of 4q is AY956760 (590-kb proximal to the D4Z4 array), the reported product of the heat shock protein gene HSP90AA4P. No DH site was observed in its vicinity in myoblasts or the other investigated cell types (data not shown).
We next compared the positions of DH sites along the length of 4q35.2 between myoblasts and diverse cell types. Many reproducible cell type-specific differences were observed, including differences in DH sites in gene deserts (Figure 5). Importantly, myoblast-specific differences in DH sites at documented genes were seen most prominently within the large FAT1 gene (Figure 5 and data not shown), which is subject to complicated alternative splicing and implicated in modulating cell contacts (40) and repair of vascular smooth muscle injury (41). For replicates of a given cell type, the positions and relative signal intensities of DH sites were remarkably consistent, even when the replicates were from different individuals (as for myoblasts and fibroblasts). This indicates that genetic polymorphisms are not responsible for the observed cell type-specific differences. Myoblasts were most similar to skin fibroblasts in their distribution of DH sites along subtelomeric 4q, although these two cell types did display some differences (Figure 5). The muscle specificity of DH sites in 4q35.2 was confirmed in preliminary whole-genome DNase-seq analysis of control myoblasts (Crawford, G.E. and Ehrlich, M. et al., unpublished data).
In 4q35.2, there were 28 DH sites found in all six myoblasts cell cultures (Supplementary Table S1). Five were located in the 5′-gene regions (within 2-kb upstream through the first intron; Supplementary Table S1, boldface). These 5′-region DH sites were more GC-rich than the overall human genome, as is often found for immediate 5′-regions. Most of the other DH sites did not share this property, but rather had GC percentages similar to that of bulk human DNA (42% G + C).
We looked for primary and secondary structure motifs that might be associated with the DH sites (including DH272) that mapped within the gene deserts of 4q35.2 in myoblasts (Supplementary Table S1). This set of sites was compared to four analogous sets of randomly chosen sequences of similar G + C content and the length as for the DH sites (35– 48% G + C; 338–1642 bp). No significant differences were seen between DH sites and random control sequences in the frequencies of transcription factor consensus motifs, the possible 4–6 k-mers (DSGene, http://accelrys.com), the free energy of predicted secondary structures [Mfold web server, (42)] or the frequency of potential intramolecular G quadruplexes [GQRS Mapper, (43)].
One striking characteristic of DH sites in 4q35.2 was that ~20% of those observed in all six myoblast cultures (Supplementary Table S1) overlapped a simple tandem repeat [STR, (44)]. BLAST searches indicated, with one exception, that the sequences of these DH-STR sites were located at a single chromosomal position, as tandem repeats in the reference genome. The exception, DH15, was located 15-kb proximal to D4Z4 (Figure 2) and had 84–97% identity to sequences on chromosomes 3, 7 and 17. Four out of five of the most distal DH sites (80%) observed in all six myoblast cultures overlapped an STR while only one out of the remaining 23 DH sites (5%) showed such overlap (Figure 2, grey circles). The STRs overlapping DH sites in 4q35.2 had consensus lengths of 27–142 nt and were tandemly repeated 3–59 times (Supplementary Table S1). They are unlikely to be artifacts because all DH sites in DNase-chip were identified by determining enrichment of DNase treated versus randomly sheared material from the same individual (25). DH15, located 15-kb proximal to D4Z4, was the only DH site overlapping an STR that was also detected in whole genome DNase-seq data from other cell types (data not shown). However, such STR-containing DH sites may be missed because DNase-seq experiments filter sequence tags that map to more than four places in the reference genome (hg18). Interestingly, the repeat overlapping DH15 has only three tandem copies in the reference genome.
At the 1-Mb terminus of 4q, there were seven DH sites observed in three to five of the six myoblast cultures, rather than in all six of them (Figure 4 and data not shown). Only one of these did not overlap an STR, namely, DH272. It is located 272-kb proximal to D4Z4, outside the region of high homology between subtelomeric 4q and 10q. It was also beyond the region shown to be deleted in rare FSHD families with no effect on the phenotype [Figures 1 and and4,4, (16)].
DH272 was observed preferentially in FSHD versus control myoblast cell strains (Figure 4). In other cell types examined by DNase-seq, this DH site was usually found at least as a small peak (Figure 4 and data not shown). Data on CTCF binding to this region were available from ChIP followed by next-generation sequencing for K562, HepG2 and GM12878 lymphoblasts (V. Iyer and B.-K. Lee, recently released data) or by tiling array analysis for lung fibroblasts (4). CTCF binding overlapped DH272 (as well as DHFRG1) in all of these cell types (Figure 4 and data not shown). Myoblasts have not yet been analyzed for CTCF binding in this region but most CTCF binding sites seem to be invariant among different human cell types (4). Sequence conservation among vertebrates was observed in the DH272 peak, including at the consensus sequence CTCF site, and was greater than that at DH8, 9 or 10 (Supplementary Figures S2 and S5; Figure 3A; and data not shown). In a preliminary comparative genome hybridization, using high-density tiling arrays to compare two FSHD samples with one control sample, we found no copy number variations in the region of DH272.
Because DH sites sometimes mark promoters of unannotated transcripts, we searched for evidence of transcription surrounding DH272, DH8, DH9 and DH10. Amplicons for qRT-PCR were chosen on the basis of sequence conservation, proximity to the DH site and locations of predicted genes (Supplementary Figure S5). We compared different unique amplicons with similar PCR efficiency (as determined on genomic DNA) by qRT-PCR using cDNA synthesized by random-hexamer priming. Several subregions in the vicinity of DH272 had significantly higher levels of RNA than others (Figure 6A). The more highly expressed amplicons were located 3.0- or 17-kb distal or 2.0-kb proximal to the center of DH272. This interval spans 19 kb. It is interrupted by amplicons that gave significantly less RT-PCR product (Figure 6A), possibly due to posttranscriptional processing or the use of several transcription units. The RNA levels for these three amplicons were ~200- to 500-fold lower than that for the HPRT standard, a moderately transcribed gene. Therefore, these RNA levels were low but within the range of weakly expressed, but well-documented, genes (45).
In control fibroblast cell strains and lymphoblastoid cell lines (two each), transcripts from the three most highly expressed amplicons in the vicinity of DH272 were also significantly (P < 0.05) more abundant than those of neighboring amplicons (data not shown). No significant tissue-specific differences were seen among myoblasts, fibroblasts and lymphoblasts. No FSHD-specific differences were observed when comparing RNA from four FSHD patients and three normal controls (data not shown).
FSHD is associated with short arrays of the macrosatellite D4Z4 at subtelomeric 4q but not at subtelomeric 10q. It is still uncertain why, despite the near identity of 4q and 10q D4Z4 and much homology proximally and distally, FSHD is a 4q-specific disease. This dominant disease is caused by the reduction in size of a 4q D4Z4 array past a near-threshold of ~36 kb (Figure 1). For example, contraction of a 40-kb array (with 12 3.3-kb repeat units) to one of 30 kb (with 9 3.3-kb repeat units) can result in the disease. We proposed that FSHD involves pathogenic long-range looping in cis of the centromere-proximal end of D4Z4 chromatin with 4q-specific sequences at 4q35.2 that is enabled by changes in intra-array chromatin looping dependent on the array size (20,46). The importance of pathogenic chromatin structure changes to this disease is indicated by recent evidence for FSHD-specific chromatin alterations in the D4Z4 array itself in FSHD patient's cells (47,48). In addition, the most proximal D4Z4 repeat unit apparently has a more open structure than the bulk of the array (20,48,49). Many experimental studies of the molecular genetics of FSHD do not duplicate the unusual chromatin environment of 4q35.2, which is likely to be critical for this disease in view of its 4q specificity. We used DNase-chip to examine 4q35.2 for chromatin features suggestive of a distinctive higher order structure. Given the lack of definitive findings about cis effects of short D4Z4 arrays at 4q35.2 on gene expression (10–14,34,35), DNase-chip also served as an annotation-neutral method of finding evidence for undocumented genes that may be important to FSHD in this gene-sparse 4-Mb region.
At 4q35.2, we found 28 DH sites detectable in all six examined myoblast cultures from FSHD patients or normal controls. As expected, most were located in the proximal 1 Mb of 4q35.2, the most gene-rich subregion. Surprisingly, within the bifurcated 3.1-Mb gene desert at 4q35.2 (Figure 2), 12 DH sites were observed in all tested myoblast cultures >100 kb from the nearest gene. For some of these DH sites, notably DH8, DH9 and DH10, the distances to the closest genes active in myoblasts were very large, >0.7 Mb. Nonetheless, these sites may identify long-distance enhancers, silencers or locus control regions (50–53). Alternatively, they might be associated with unannotated genes or structural elements, such as looping hubs (54). That DH8, 9 and 10 were observed in myoblasts and fibroblasts, both of mesodermal origin, but not in cells of the lymphoid, myeloid and hepatic lineages, suggests functionality.
In the D4Z4-proximal 1-Mb region, which is mostly gene desert, we found nine DH sites present in at least three of the six myoblast cell cultures. Only two of these, namely, DHFRG1 (in the promoter of FRG1) and DH272 (in the distal gene desert) did not overlap a DNA repeat. The other seven overlapped tandem repeats of short units (STRs). We also observed that DH sites frequently overlap STRs also in the terminal 1 Mb of 10q by DNase-chip analysis (unpublished data). While the biological significance of these DH-STRs remains to be determined, there are precedents for shorter tandem repeats influencing nucleosome positioning and excluding nucleosomes (55,56). With respect to DHFRG1, the DH site at the FRG1 promoter, one group reported overexpression of FRG1 RNA in FSHD muscle (17) but several others were unable to confirm this (11–13). In this study, we found no difference between control and FSHD myoblasts in this DH site and no significant difference in the amount of RNA product. DH272, the unique DH site located 150 kb proximal to FRG1, was observed preferentially in FSHD versus control myoblast cultures. Preliminary results from DNase-seq on three other control myoblast cell strains also revealed little or no DH peak at the position of DH272. We found nearby unannotated transcripts (probably non-coding RNAs, Supplementary Figure S5) that were not FSHD-specific in myoblasts. However, further study is needed of both myotubes and myoblasts to test the possibility of disease-linked expression of amplicons in the vicinity of DH272 and other DH sites in the 4q35.2 gene desert.
Even DH sites in 4q35.2 that did not display FSHD-related differences might be involved in pathogenic chromatin looping interactions. DH sites could be identical in both normal and disease cells, but the 3D structure (looping) and protein complexes that bind to them could differ between them. Given that DH sites can be associated with loci at which chromatin looping occurs (54), our results suggest subregions of 4q35.2 with the potential for these chromatin interactions that should be investigated. Our study points to the DH272 region as particularly attractive for searching for FSHD-related sequences because of the overlap of DH272 with a CTCF sequence found in many cell types [Figure 4 and (4)]. Moreover, the potential CTCF binding site identified in this ChIP-positive region of lung fibroblasts by Kim et al. (4) matched the CTCF consensus sequence at 19 out of 20 nt. CTCF is a sequence-specific DNA-binding protein with diverse functions, including as an insulator and organizer of chromatin looping (54,57). CTCF might play a role in our proposed pathogenic looping of 4q35.2-specific sequences to a short pathogenic D4Z4 array because D4Z4 was recently shown to have a CTCF-binding sequence (47). Some evidence was presented for increased binding of CTCF to D4Z4 in FSHD versus control myoblasts (47).
In addition to the revealing candidate DNA sequences for FSHD involvement in gene deserts, our data indicate that FAT1 transcription warrants further study of possible differences in FSHD and control muscle cells beyond the few studies involving expression microarrays (11,13). FAT1 is the only annotated 4q35.2 gene with evidence for complex tissue-specific expression and, in this study, a muscle-specific pattern of DH sites. Many myoblast-specific DH sites were found in and around this large gene in both FSHD and control cell strains, suggesting that this subregion contains active regulatory elements associated with the muscle lineage. The cell type-specific differences in chromatin that we observed are consistent with tissue-specific production of multiple FAT1 RNA and protein isoforms from predicted gene-internal promoters and by alternative splicing (http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/). FAT1, which contains >25 exons, encodes a cadherin-type integral membrane protein which is implicated in diverse developmental and signaling pathways, including in vascular smooth muscle remodeling (58).
It has been proposed that a short disease-linked D4Z4 array at 4q35.2, but not at 10q26.3, triggers abnormal transcription of DNA sequences within the array itself in affected FSHD muscle cells (34,35,39,59). Overexpression of DUX4 RNA, derived from the 1.6-kb gene inside each 3.3-kb D4Z4 repeat unit, was reported in FSHD myotubes relative to control myotubes (35) but truncated transcripts or transcripts from other portions of the D4Z4 repeat unit are more prevalent than full-length DUX4 transcripts (34). Currently, definitive conclusions as to the relationship of D4Z4 transcription and pathogenicity are precluded by low expression levels, small numbers of samples, many cross-hybridizing sequences and the variety of small transcripts (34). If dysregulated expression of some D4Z4 sequence from short arrays initiates abnormal gene expression in FSHD, it remains to be explained why it is only short 4q arrays that cause the disease despite the ~98% identity between 4q and 10q D4Z4 (8) and homology outside the arrays (Figure 1). In addition, exchanges between the almost (but not completely) identical 4q and 10q D4Z4 arrays are rather frequent and can result in an array with 4q-type repeat units replacing all the 10q units (60). Nonetheless, short D4Z4 arrays cause disease only when they reside on 4q (61). Therefore, polymorphisms that were found to be associated with canonical 4q-type D4Z4 units, but not canonical 10q-type D4Z4 units (8), are unlikely to explain the 4q linkage of FSHD.
We propose that the chromosomal environment of 4q35.2 plays a key role in the 4q-specific nature of FSHD, whether abnormal expression from 4q containing a short D4Z4 array initiates from within or outside D4Z4. Both at the DNA and the chromosome levels, 4q35.2 is unusual. It has the lowest gene density in its terminal 3 Mb of any of the q arms. It is distal to a large bifurcated gene desert punctuated centrally by a few genes that appear to be critical in early embryogenesis. Like some other genes (62), especially those important in the control of development (63), these inter-desert genes may be flanked by gene deserts to help keep their expression tightly restricted to certain stages in development. They might be part of large blocks chromatin with distinguishing epigenetic features. In CD4+ cells, this gene desert region has histone modifications [(5) and http://genome.ucsc.edu] indicative of inactive euchromatin rather than constitutive heterochromatin. This is consistent with our previous immunocytochemical and DNA replication analyses of FSHD and control myoblasts (64). However, given the complexity of epigenetic modification of chromatin, there can be a variety of types of large distinctive chromatin blocks within euchromatin (65).
One of the properties that distinguishes subtelomeric 4q (which can have pathogenic D4Z4 arrays) and 10q (whose D4Z4 arrays are always phenotypically neutral) is that only the 4q subtelomere (and not 10q or 4q) has a strong association with the nuclear rim in FSHD and control myoblasts and myotubes (66). A marker that was 0.22 Mb from D4Z4 on 4q35.2 (close to DH272) showed a significantly closer association with the nuclear periphery than did D4Z4. The unusual localization of subtelomeric 4q to the nuclear periphery might be necessary for pathogenicity. This localization may result partly from its uncommonly large region of inactive euchromatin (67,68) in a distinctive conformation, as reflected in its low concentration of DH sites. Our results emphasize the underappreciated importance of considering the regional chromatin context of D4Z4 in analysis of the mechanism by which contraction of D4Z4 to a size of <36 kb can lead to disease (69).
Supplementary Data are available at NAR Online.
National Institutes of Health (NS04885 to M.E., HG003169 to G.E.C.); the FSH Society (to M.E.); Fields Center for FSHD and Neuromuscular Research (R.T. and J.S.). Funding for open access charge: The National Institutes of Health [NS04885 to M.E.].
Conflict of interest statement. None declared.
We are grateful to Dr V. Vedanarayanan for several of the muscle samples from which myoblast cell strains were generated and to Dr Vishy Iyer and Bum-Kyu Lee, who generated the ENCODE CTCF ChIP-seq data.