|Home | About | Journals | Submit | Contact Us | Français|
Transposable element (TE) derived sequences comprise half of our genome and DNA methylome, and are presumed densely methylated and inactive. Examination of the genome-wide DNA methylation status within 928 TE subfamilies in human embryonic and adult tissues revealed unexpected tissue-specific and subfamily-specific hypomethylation signatures. Genes proximal to tissue-specific hypomethylated TE sequences were enriched for functions important for the tissue type and their expression correlated strongly with hypomethylation of the TEs. When hypomethylated, these TE sequences gained tissue-specific enhancer marks including H3K4me1 and occupancy by p300, and a majority exhibited enhancer activity in reporter gene assays. Many such TEs also harbored binding sites for transcription factors that are important for tissue-specific functions and exhibited evidence for evolutionary selection. These data suggest that sequences derived from TEs may be responsible for wiring tissue type-specific regulatory networks, and have acquired tissue-specific epigenetic regulation.
A large portion of eukaryotic genomes is derived from transposable elements (TEs)1. TEs have been described as parasitic or junk DNA. However, there is mounting evidence for their evolutionary contribution to the wiring of gene regulatory networks2-7, a theory rooted in Barbara McClintock’s discovery that TEs can control gene expression3,8,9. TEs contain functional binding sites for transcription factors6,10,11; TE DNAs are presumed to be methylated in somatic cells to suppress transposition and TE-mediated changes in gene expression12-14. However, the extent to which DNA methylation silences TEs and how DNA methylation-mediated silencing of TEs is reconciled with the known regulatory function of TE sequences remain unexplored.
To construct TE DNA methylation profiles we assayed 29 human samples representing 11 cell types using two complementary DNA methylomics methods: MeDIP-seq and MRE-seq15,16. Tissue and cell types included embryonic stem cells (ESC H1); fetal brain tissue and primary neural progenitor cells (derived from cortex or ganglionic eminence regions); primary adult breast epithelial cells (luminal epithelial cells, myoepithelial cells, and a progenitor cell-enriched population); unfractionated peripheral blood mononuclear cells (PBMC), and adult immune cells including CD4+ naïve, CD4+ memory, and CD8+ naïve cells.
Mapping short-read data to TEs is difficult due to the high copy number of these elements. Standard mapping often discard or mis-align high quality reads derived from TEs (Supplementary Note). We developed a computational strategy termed Repeat Analysis Pipeline (RAP) that allows mapping of reads derived from repetitive elements to one of 1,395 specific families of human repeats including 928 TE families (Supplementary Fig. 1-5, Note). RAP includes features of three previously published methods17-20 combined with novel technical modifications (Methods).
As expected, sequences of the majority of TE families were methylated in all samples examined. The total MeDIP-seq signal, which represents the proportion of individual TE families that are methylated, correlated tightly with the total number of CpGs in that TE family, consistent with the high level of DNA methylation in TEs (R2=0.95, Supplementary Fig. 6-9). In contrast to TE families, total MeDIP-seq signal was 4.9% in promoter CpG islands after normalizing for CpG content, consistent with the unmethylated status of promoter CpG islands. Conversely, MRE-seq signal, which measures unmethylated DNA, was 6.7-fold more enriched over promoter CpG islands than in TEs (Supplementary Fig. 6-9).
Strikingly, we found sequences of numerous TE families that were differentially methylated in specific cell-types. Unsupervised clustering of samples based on TE methylation revealed a clear relationship among tissue-types, indicating that TE methylation is a signature that can distinguish tissue- or possibly cell-types (Fig. 1a, b). We identified 14 TE families with significant (p<0.05, ANOVA) hypomethylation patterns in brain samples, 55 in breast samples, 13 in blood samples, and 13 in ESC (total 95 TE families, p<0.05, ANOVA). More than 800 other families were consistently methylated across cell types from these 29 samples (Supplementary Note). Most tissue-specific hypomethylated TEs belonged to the ERV/LTR class (69/95), whereas 12 were DNA transposon families (Supplementary Table 1). These findings are consistent with previous studies that have shown that LTR-elements participate in regulation of mammalian genes3,21-24, and support the hypothesis that LTRs might play a role in the epigenetic regulation of cell-type specific gene expression. For each TE family, we identified individual copies that were uniquely mappable and were tissue-specifically hypomethylated. The complete list of TE families and coordinates of individual elements are provided at our website (Supplementary Note).
We next investigated the genomic distribution of members of TE families showing tissue-specific hypomethylation. Their proximities to “known genes” were not different from being expected by chance (Supplementary Fig. 10). However, genes near members of these TE families were significantly enriched for functions specific to the tissue type in which they were hypomethylated (Table 1 and Supplementary Table 2). For example, hypomethylation of the UCON29 DNA transposon was restricted to fetal brain, and 11 of the 60 genes with a nearby UCON29 element are involved in neuron development (p<6.6×10−23, binomial test). Another brain-specific hypomethylated retroelement, LFSINE, was located near 19 out of 87 genes involved in telencephalon development (p<1.5×10−5, binomial test). Similarly, genes associated with LTR12 and LTR77, two ERVs hypomethylated in immune cells, were enriched for immune-related functions, including ‘antigen processing and presentation of peptide or polysaccharide antigen via MHC class II’ (p<7.4×10−6, binomial test), and ‘oxidation reduction’ (p<3.7×10−6, binomial test). While antigen processing and presentation is a known function of lymphocytes and other antigen-presenting hematopoietic cells, the enrichment of genes in the oxidation-reduction process was interesting because T-cell activation, differentiation and proliferation are sensitive to the redox potential25,26.
DNA hypomethylation has been associated with distal regulatory regions27. We next asked if TE sequences with tissue-specific DNA hypomethylation possessed other tissue-specific epigenetic signatures. We generated histone modification data (H3K4me1, H3K4me3, H3K27me3, H3K36me3 and H3K9me3) from these same tissues, and collected p300 genome-wide locations from related tissues28 (Fig. 2). Sequences within hypomethylated TE families displayed remarkably strong tissue-specific H3K4me1 signals. For example, LTR77, a TE of the ERV class, had the lowest methylated (MeDIP-seq) signal and the highest unmethylated (MRE-seq) signal in blood (Fig. 2a). When we applied RAP to H3K4me3 and H3K4me1 ChIP-seq data from the same samples, we found much stronger signals within the LTR77 family in T cells compared to the three other cell and tissue types (Supplementary Fig. 11). Using data from CD8+ naïve cells, we identified a “histone signature” for all 148 LTR77 copies along with a 3kb region flanking the LTR (Fig. 2b,c). We observed a strong H3K4me1 peak over the LTR element itself, suggesting that at least some LTR77 elements had this enhancer mark. The H3K4me3 peak detected 3kb downstream suggested nearby promoter activities, potentially from genes regulated by enhancers embedded in LTR77. LFSINE and UCON29 displayed H3K4me1 enrichment specifically in fetal brain (Fig. 2f,g, and Supplementary Fig. 12). Moreover, LFSINE and UCON29 both accumulate p300 binding signals in the neuroblastoma cell-line SK-N-SH, but not in any non-neural cell lines including ESC, HepG2, or GM12878 (Fig. 2h, Supplementary Fig. 12). Similarly, the T cell-specific hypomethylated TE LTR77 accumulated p300 binding signal in GM12878 (a lymphoblastoid cell-line), but not in any other cell type (Fig. 2d). These results suggested that hypomethylated DNA sequences derived from TEs might serve as tissue-specific enhancers.
We next asked if any of these hypomethylated, enhancer-like sequences within TE might contribute to tissue-specific gene expression. We selected candidate TEs that could be uniquely mapped using our data. As a proof of principle, we focused on two putative target genes: ERAP1, a gene in the generation of most HLA class I-binding peptides, and the glial cell line-derived neurotrophic factor (GDNF) family receptor alpha-1 GFRA1, a neurotrophic factor involved in the control of neuron survival and differentiation29 (Fig. 3a,d). A LTR77 element was detected 2kb upstream of an ERAP1 alternative transcription start site. Our genome-wide data suggested that this element was hypomethylated in T-cells, a prediction confirmed by locus-specific bisulfite-sequencing (Fig. 3b). In addition to enhancer-like signature, NF-kB and Pol2 ChIP-seq peaks were observed in a lymphoblastoid cell-line (GM12878), but not in a non-lymphoblastoid cell-line (HepG2). Consistently, ERAP1 exhibited the highest expression in T-cells (Fig. 3c). This LTR77 element exhibited modest enhancer activity in 293T, SK-N-SH, and GM12878 cells based on reporter assay (Supplementary Fig. 13, LTR77-1). In the brain samples, GFRA1 appeared as a putative target of an LFSINE element (Fig. 3d). We observed tissue-specific H3K4me1 marks and a H3K4me3 mark in the promoter region in fetal brain, but not in T-cells (Fig. 3d). Transcription factor binding motifs, such as that for SOX10, a regulator of neural crest and glial cell development30,31, were identified in the hypomethylated LFSINE element upstream of GFRA1. Consistent with the hypothesis that LFSINE is a tissue-specific enhancer, GFRA1 was highly and specifically expressed in neuronal cells (Fig. 3f). This element exhibited enhancer activity in 293T and SK-N-SH cells but not in GM12878 (Supplementary Fig. 13, LFSINE-1). Hypomethylation of these TEs did not appear to be a result of increased expression of nearby genes, since the hypomethylation was not observed for other TE families in the same genomic neighborhood (Fig 3a, d). Additional members of the LTR77, LTR12, UCON29 and LFSINE subfamilies were validated and shown to exhibit tissue-specific hypomethylation and associate with nearby tissue-specific gene expression (Supplementary Fig. 14, 15). Of the 36 TE derived candidates for which we performed reporter gene assay, 26 showed enhancer activities ranging from 5- to 1000-fold increase in at least one of the three cell-lines tested (Supplementary Fig. 13). These hypomethylated TE sequences have not been previously annotated as functional elements, but our results suggest that they may influence tissue-specific gene expression.
We next examined the relationship between sequences of TEs, their epigenetic status, and transcription factor binding. We analyzed histone modification and binding data of transcription factors of two cell-lines (GM12878 and SK-N-SH) published by ENCODE32,33. We focused on individual copies of two TE families that exhibited tissue-specific hypomethylation in either blood (LTR77) or fetal brain (LFSINE). Consistent with our previous findings, members of these two TE families enriched for enhancer marks in a cell type-specific manner (Fig. 4) – LTR77 exhibited H3K4me1 mark and p300 binding in GM12878, but not in SK-N-SH; LFSINE exhibited p300 binding in SK-N-SH, but they did not enrich for H3K4me1 or p300 signal in GM12878. Binding sites of several transcription factors were enriched in LTR77 and LFSINE and showed cell type specificity (Fig. 4). For example, NF-kB binding overlapped specifically with LTR77 in GM12878; Rad21 bound within LFSINE more than within LTR77; and Rad21bound within LFSINE more in SK-N-SH than in GM12878 (Fig. 4). Not surprisingly, many TEs were predicted to contain a sequence motif when scanned using position specific weight matrices of transcription factors (Fig. 4). Having a motif was neither necessary nor sufficient for the actual binding, which correlated strongly with cell type-specific enhancer mark. Taken together, ENCODE data confirmed that sequences of specific TE families exhibited cell type-specific enhancer signatures and cell type-specific transcription factor binding. Whether there is a causal relationship between the TEs’ epigenetic mark and transcription factor binding awaits further investigation.
For decades, TEs have been deemed as parasitic DNA as a result of the impact of their transposition in the genome34,35. Transposition of TEs may be deleterious when they disrupt coding sequences or normal gene expression, resulting in human diseases36-38. Thus, it is believed that cells have acquired epigenetic mechanisms to cope with TEs so that transposon-derived sequences are completely methylated and transcriptionally silent in somatic tissues14,39.
However, TE transpositions might provide diverse genetic material for natural selection, which would contribute to the evolution of species-specific traits and population biodiversity40,41. Many functional elements were born by “exaptation”, a process in which DNAs of a transposon are co-opted to benefit the host42-44. TE insertions with regulatory functions have been described in mammals4,5,7,45. A substantial proportion of constrained non-coding sequences arose from TEs46,47, pointing to transposons as a driving force in the evolution of regulation network. Some hypomethylated TE subfamilies identified here were conserved based on their PhastCons and PhyloP scores, suggesting that this conservation might be a consequence of selection (Supplementary Fig. 16, 17). While we do not know how many TEs could have regulatory functions, previous reports indicate that 5% of TEs are under evolutionary constraint46,47. TE sequences were incorporated in gene networks under the control of transcription factors including TP536, OCT44,7, CTCF48, and MER20 was reported to have contributed to the origin of pregnancy in placental mammals5. TE-derived sequences can directly regulate expression. For example, ISL1 is regulated by a SINE element49, and so is FGF8 in the forebrain50. In both cases, TEs provide distal enhancers that help control expression of host genes, and their hypomethylation status in brain cells was confirmed by our genome-wide data (Supplementary Fig. 14).
Our findings help to resolve the conflicting observations that TE sequences are globally suppressed by epigenetic mechanisms, including DNA methylation, but that they can mediate gene regulation in some instances. In this study, we challenge the general notion that TEs are constitutively methylated by examining the extent to which TE methylation differs between cell-types and the relationship between epigenetic silencing and TE sequences’ potential to impact gene regulation. Epigenetic control of TEs may contribute to developmental stage-specific, cell type-specific, and perhaps health condition-specific gene regulation. Distal regulatory regions are methylated at low levels, display enhancer chromatin marks, and are occupied by cell type-specific transcription factors27. Our results suggest that some TE sequences match this profile of distal enhancers. With a few exceptions51,52, majority of human TEs were fixed and no longer active. Sequences within these TEs, however, could be adapted to serve as enhancers, and these sequences might be the reason for their epigenetic regulation. The mechanisms through which DNA within TEs is demethylated and obtains enhancer chromatin marks, and the relationship between TE-derived enhancers and other regulatory elements remain to be elucidated. A recent report demonstrated transposons on a human chromosome acquired activating histone modifications and changed DNA methylation status in mouse cells53. In rodents, some endogenous retroviruses function as species-specific enhancers in the placenta54. Therefore, as a source of new regulatory elements, TEs’ regulatory potential could be controlled by tissue- or cell type-specific epigenetic regulation. In our study, examination of DNA methylation in four distinct tissue types showed that while sequences of many TE families are globally hypermethylated, about 10% of TE families are hypomethylated in a tissue-specific manner and gain distal enhancer signatures. Analysis of a more extensive panel of tissues may reveal that a much larger portion of sequences derived from TEs may harbor gene regulatory function.
Further details for computational analyses are provided in the Supplementary Note.
Buffy coats were obtained from the Stanford Blood Center (Palo Alto, CA). Blood was drawn and processed on the same day. Peripheral Blood Mononuclear cells (PBMC) were isolated by Histopaque 1077 (Sigma-Aldrich. Saint-Louis, MO) density gradient centrifugation according to the manufacturer’s protocol. Further purification of CD4 memory, CD4 naïve, and CD8 naïve T lymphocytes was performed using a Robosep instrument and isolation kits for each subpopulation as listed below (STEMCELL Technologies, Vancouver, BC, Canada). Total PBMC were karyotyped (Molecular Diagnostic Services Inc. San Diego, CA) and analyzed for cell cycle. PBMC and T cell subpopulations were stained with antibodies and analyzed by FACS for purity. Cells were aliquoted for DNA and RNA samples, and were washed in PBS. Cell pellets for RNA samples were resuspended in 1 ml TRIzol reagent (Invitrogen, Carlsbad, CA), and frozen at −80°C. Cell pellets for DNA samples were flash frozen in liquid nitrogen and stored at −80°C. Reagents and Antibodies:
Breast tissues were obtained from disease-free pre-menopausal women undergoing reduction mammoplasty in accordance with institutionally approved IRB protocol # 10-01563 (previously CHR # 8759-34462-01). All tissues were obtained as de-identified samples and linked only with minimal dataset (age, ethnicity and in some cases parity/gravidity). Tissue was dissociated mechanically and enzymatically, as previously described56. Briefly, tissue was minced and dissociated in RPMI 1640 with L-glutamine and 25mm HEPES (Fisher, cat # MT10041CV) supplemented with 10% fetal bovine serum (JR Scientific, Inc, cat # 43603), 100 units/ml penicillin, 100μg/ml streptomycin sulfate, 0.25μg/ml fungizone, gentamycin (Lonza, Cat # CC4081G), 200U/ml collagenase 2 (Worthington, cat # CLS-2) and 100U/ml hyaluronidase (Sigma-Aldrich, cat # H3506-SG) at 37°C for 16h. The cell suspension was centrifuged at 1,400rpm for 10min followed by a wash with RPMI 1640/10% FBS. Clusters enriched in epithelial cells (referred to as organoids) were recovered after serial filtration through a 150-μm nylon mesh (Fisher, cat # NC9445658), and a 40-μm nylon mesh (Fisher, cat # NC9860187). The final filtrate contained primarily mammary stromal cells (fibroblasts, immune cells and endothelial cells) and some single epithelial cells. Following centrifugation at 1,200rpm for 5min, the epithelial organoids and filtrate were frozen for long-term storage. The day of cell sorting, epithelial organoids were thawed out and further digested with 0.5g/L 0.05% trypsin-EDTA and dispase-DNAse I (STEMCELL Technologies, cats # 7913 and # 7900, respectively). Generation of single cell suspensions was monitored visually. Single cell suspensions were filtered through a 40-μm cell strainer (Fisher, cat # 087711), spun down and allowed to “regenerate” in MEGM medium (Lonza) supplemented with 2% fetal calf serum for 60-90min at 37°C. This “regeneration” step enables quenching of trypsin and re-expression of the cell surface markers prior to staining as their extra cellular domain had been cleaved by trypsin.
The single cell suspension obtained as described above was stained for cell sorting with three human-specific primary antibodies, anti-CD10 labeled with PE-Cy7 (BD Biosciences, cat # 341092) to isolate myoepithelial cells, anti-CD227/MUC1 labeled with FITC (BD Biosciences cat # 559774) to isolate luminal epithelial cells or anti-CD73 labeled with PE (BD Biosciences, cat # 550257) to isolate a stem cell-enriched cell population, and with biotinylated antibodies for lineage markers, anti-CD2, CD3, CD16, CD64 (BD Biosciences, cat # 555325, 555338, 555405 and 555526), CD31 (Invitrogen, cat # MHCD3115), CD45, CD140b (BioLegend, cat #s 304003 and 323604) to specifically remove hematopoietic, endothelial and leukocyte lineage cells, respectively, by negative selection. Sequential incubation with primary antibodies was performed for 20min at room temperature in PBS with 1% bovine serum albumin (BSA), followed by washing in PBS with 1% BSA. Biotinylated primary antibodies were revealed with an anti-human secondary antibody labeled with streptavidin-Pacific Blue conjugate (Invitrogen, cat # S11222). After incubation, cells were washed once in PBS with 1% BSA and cell sorting was performed using a FACSAria II cell sorter (BD Biosciences).
Post-mortem human fetal neural tissues were obtained from a case of twin non-syndrome fetuses whose death was attributed to environmental/placental etiology. Tissues were obtained with appropriate patient consent according to Partner’s Healthcare/Brigham and Women’s Hospital IRB guidelines (Protocol #2010P001144). All samples and tissues were de-identified and linked only with minimal dataset (age, gender, brain location). Fetal brain tissue and fetal neural progenitor cells were derived from manually dissected regions of the brain (telencephalon), specifically the neocortex (pallium; GSM666914, GSM669615, GSM669610, GSM669612) and ganglionic eminences (subpallium; GSM669611, GSM669613). The tissues were minced and dissociated by combination of mechanical agitation (gentleMACS device) during enzymatic treatment with papain according to manufacturer’s protocol (Miltenyi Biotec, Neural tissue dissociation kit #130-092-628). Cell suspensions were then washed twice in DMEM and plated at low density in human NeuroCult NS-A media (Stem cell technology # 05751) supplemented with heparin, EGF (20ng/ml) and FGF (10ng/ml) in ultra low attachment cell culture flasks (Corning #3814).
Data were obtained from a previous publication15.
All assays were performed as part of the NIH Roadmap Epigenomics Mapping Centers’ repository for human reference epigenome atlas57. Experiments were performed under the guidelines of Roadmap Epigenomics project (http://www.roadmapepigenomics.org/protocols). Specifically, MeDIP-seq and MRE-seq were performed as previously described16. ChIP-seq was performed as described in 58. All data have been submitted to NCBI (Supplementary Table 3).
Total genomic DNA underwent bisulfite conversion following an established protocol59 with modification of: 95 °C for 1 min, 50 °C for 59 min for a total of 16 cycles. Regions of interest were amplified with PCR primers (see below) and were subsequently cloned using pCR2.1/TOPO (Invitrogen). Individual bacterial colonies were subjected to PCR using vector-specific primers and sequenced using an ABI 3700 automated DNA sequencer. The data were analyzed with online software BISMA60. Result is summarized in Supplementary Fig. 13. Genomic locations of candidates and primer information are summarized in Supplementary Table 4.
TE candidates were amplified from genomic DNA using Pfu-polymerase (Agilent) and primers containing KpnI- or BglII- restriction sites. PCR products were gel-purified using Qiagen Gel purification kit, and then digested by the corresponding restriction enzymes (NEB). The digested PCR products were cloned into the pGL4.23[luc2/minP]-vector (Promega, E8411) using T4-ligase(NEB) and transformed into chemical competent DH5α-cells. The positive clones were verified by enzyme digestion and sequencing. 800 ng of reporter plasmid (or empty pGL4.23[luc2/minP]-vector control) were transfected into 3 different cell lines, 293T, GM12878, and SK-N-SH_RA which were differentiated with 6 μM of retinoic acid for 48 hours from SK-N-SH cells, using X-tremeGENE (Roche) in triplicate. In order to normalize the transfection, 200 ng of renilla luciferase plasmid driven by a TK promoter were co-transfected. The luciferase activity was measured after 48 hours, and normalized by the relative renilla control. Genomic locations of candidates and primer information are summarized in Supplementary Table 5.
We thank the many collaborators in Reference Epigenome Mapping Centers (REMCs), Epigenome Data Analysis and Coordination Center and NCBI who have generated and processed data which were used in this project. We acknowledge the dedicated system administrators at Washington University Center for Genome Sciences and Systems Biology who have provided an excellent computing environment. We thank UCSC Genome Browser bioinformatics team for providing processed ENCODE data. We acknowledge support from NIH Roadmap Epigenomics Program, sponsored by the National Institute on Drug Abuse (NIDA) and the National Institute of Environmental Health Sciences (NIEHS). J.F.C., T.W., P.F. and M.H. are supported by NIH grant 5U01ES017154. B.Z and X.Z. are supported by NIDA’s R25 program DA027995. K.L.L. and C.M. are supported by NIH grant P01CA095616 and P01CA142536. T.W. is supported in part by the March of Dimes Foundation, the Edward Jr. Mallinckrodt Foundation, P50CA134254 and a generous start up package from Department of Genetics, Washington University School of Medicine.
Author contributions J.F.C and T.W. designed the study. C.L.M, K.L.L., P.G., M.S., T.D.T., T.K, and A.W. collected samples. C.H., H.O., P.J.F., A.J.M., A.T., B.K., S.C., R.M., M.H., and M.A.M. performed sequencing assays. M.X., B.Z., R.L., D.L., X.Z., H.J.L., P.A.F.M, and T.W. performed data analysis. C.H., X.X., and M.X. performed bisulfite validation and reporter gene assays. M.X., J.F.C. and T.W. wrote the manuscript. All authors discussed the results and contributed to writing the manuscript.
Competing financial interests The authors declare no competing financial interests.
Accession codes Complete datasets used in this study: