Recent advances in high-throughput sequencing technologies have revealed that >90% of the human genome is transcribed, of which only 1–2% accounts directly for protein synthesis (
45). It is increasingly evident, in humans and other organisms, that the transcriptome is significantly more complex than previously supposed RNA having a much broader influence over manifested phenotype than implied solely by its role as messenger. Epigenetic mechanisms like cytosine methylation and histone modifications are known to influence gene expression. While aberrations of the epigenome have been found to be associated with several human diseases and disorders, there have been increasing reports associating aberrant lncRNA expression with cancer, cardiovascular disorders and other maladies (
46,
47). However, association of epigenomic features like cytosine methylation and histone modifications with lncRNA genes has not been studied at the genome-wide level.
In the present report we have tried to draw a global picture of epigenetic marks across lncRNA loci in human. The epigenetic marks studied here include histone modifications and DNA methylation, which have been extensively studied recently with relation to regulation of protein-coding genes. We performed a comprehensive analysis of DNA methylation, H3K27me3 and H3K9me3 as representative repressive marks, which have been known to be associated with chromatin repression and H3K4me3 and H3K36me3 which are representative expression-associated marks. The complete raw datasets covering the transcription repressive and activating marks were obtained from the NCBI repository. Datasets that are still under embargo could not be included in the analysis (
Supplementary Table S4). In addition, we have not included datasets from
in vitro differentiated, stem-cell-derived and transformed cell types since they are likely to have altered epigenetic profile (
48). Of the remaining cell types, we chose H1 as a representative of pluripotent embryonic stem cell, primary CD34+ as representative of multipotent haematopoietic cell, IMR90 (foetal lung fibroblast) and PBMC as representative differentiated cell types. In addition, we have chosen two tissue types, brain and liver, which represent two organs having distinct physiological roles and germinal origin (brain being ectodermic and liver mesoendodermic). Similarities in epigenetic signatures between these two tissues should reflect the global schema for distribution of epigenetic marks. Thus, our study involving such disparate cases of cell fate and identity allowed us to derive conclusions regarding the distribution of epigenetic marks in general regardless of cellular differentiation status.
DNA methylation is an important evolutionarily conserved epigenetic mark (
49). It is known that the TSS of expressed protein-coding genes is hypomethylated and is in agreement with earlier observations that the methylation density of highly expressed protein-coding genes was lowest at their TSS and remained low even downstream of the TSS (
39). In contrast to the methylation pattern around TSS in highly expressed protein-coding genes, our results indicate that in lowly expressed protein-coding genes, the methylation density showed an upward trend from TSS and was highest immediately downstream of TSS in the region of first exons. This is consistent with earlier studies where it has been shown that DNA methylation in the immediate downstream regions of TSS, i.e. in the first exon, was much more tightly linked to gene silencing than promoter methylation (
50). However, in lncRNA the methylation density is high in the downstream region of TSS, irrespective of their expression levels. Thus, unlike protein-coding genes, methylation downstream of TSS (in the first exon) is not a feature of lncRNA silencing suggesting that other factors might also be associated with lncRNA gene regulation.
Another evolutionarily conserved feature of TSS of protein-coding genes is their association with CGI (
43). About half of all CGIs contain TSSs of annotated protein-coding genes (
43). The others are classified as ‘orphan’ CGIs. The purpose of such orphan CGIs is poorly understood (
51). Several genome-wide Pol II mapping studies have revealed that a majority of these sites are also transcription initiation sites. Some of these lncRNAs like
Air and
Kcnq1ot1 have also been shown to be initiated from such ‘orphan’ CGIs present in intron of the
Igf2r and
Kcnq1 genes, respectively (
52–54). From our analysis we found an overlap of CGIs with the TSS in ~24% of lncRNA genes and by inductive reasoning we feel that such orphan CGIs might be the transcription initiation sites of other ncRNAs as well. CGI distribution within the genome is often concurrent with H3K4me3 mark (
55,
56). It is a well-accepted paradigm that DNA methylation corresponds to repressive chromatin while H3K4me3 are associated with transcriptionally active chromatin (
57,
58). From our analysis we show that occurrence of H3K4me3 marks in mRNA and lncRNA genes were higher when CGI was present, while the frequency decreases in the absence of CGI. This suggests that CGI of lncRNA are also marked by H3K4me3. However, when we looked at the association of repressive histone marks H3K27me3 and H3K9me3 with CGI present at the lncRNA and protein-coding genes, we did not find any relationship with the exception of brain germinal matrix tissue (in H3K27me3 class). We also found that ~40% TSS of protein-coding genes and ~12% TSS of lncRNA genes in brain germinal matrix tissue were having both H3K4me3 and H3K27me3 marks.
We also analysed histone modifications associated with active (H3K4me3 and H3K36me3) and repressed (H3K9me3 and H3K27me3) chromatin. The distribution pattern of H3K4me3 across cell and tissue type for both protein-coding and lncRNA showed a similar pattern with increased density at the TSS. Furthermore, presence of H3K4me3 and H3K36me3 modifications in the TSS and gene body, respectively, corresponded to higher expression of both protein-coding and lncRNA genes. This suggests that unlike the repressive methylation marks, presence of these transcription activating marks could better explain the regulation of lncRNA expression.
H3K27me3 seems to play similar roles in the expression of lncRNA and mRNA expression as the highly expressed transcripts of both classes seems to lack this mark at their TSS in contrast to higher occupancy of this repressive mark in the lowly expressed transcripts. This is consistent with a previous report suggesting that lncRNAs that are expressed at lower levels have higher H3K27me3 at their promoters. However, unlike H3K27me3, the repressive mark H3K9me3 does not seem to dictate the repression in lncRNA class as the highly expressed lncRNA also had its presence at their TSS in contrast to protein-coding genes which showed inverse correlation of expression in presence of this repressive mark.
Furthermore, H3K4me3 and H3K27me3 are known to co-occupy certain genomic regions known as bivalent domains, which are associated with the promoters of lineage regulatory genes. We observed that occurrence of these bivalent marks (H3K4me3 and H3K27me3) was maximum in brain germinal matrix tissue: 41% in mRNA genes and 12% in lncRNA genes. Brain germinal matrix tissue is a proliferative centre which is source of neurons and glials cells. In all other datasets analysed, the occupancy was between 1% and 10% for mRNA genes and 0.4–3.2% for lncRNA genes. H1 embryonic stem cells had 8.8% mRNA genes and 3.2% lncRNA genes occupied by bivalent marks. It is well known that lineage-related genes have bivalent marks in pluripotent stem cells. The role of such bivalent marks is generally believed to silence (H3K27me3) developmental lineage-specific genes while on the other hand poise them for subsequent activation via H3K4me3 during differentiation process. However, a recent study by Gobbi
et al. suggests that the relation between the presence of bivalent marks in genes and their subsequent expression during differentiation may be oversimplistic (
59). They found that genes that have bivalent marks in pluripotent and multipotent cells may be expressed at low levels during lineage priming. However, further studies are necessary to understand the implications of these bivalent marks in regulation of lineage-specific genes.
Epigenetic marks like DNA methylation and histone modifications regulate the expression of genetic message and therefore determine cellular and hence organism’s identity. LncRNAs are also involved in the manifestation of cellular identity; however, epigenetic marks governing their expression are not well characterized. We have found that a large proportion of lncRNA genes lack any of the aforesaid epigenetic marks. However, where present, they show a distribution pattern akin to that of protein-coding genes with the exception of DNA methylation. However, the distribution pattern of epigenetic features does not differ significantly for stem cells, differentiated cells and the tissue used, which indicates that the general behaviour of these processes remains unchanged regardless of differentiation and proliferative status.
Thus, our observations show that DNA methylation pattern at immediate vicinity of TSS is remarkably dissimilar for lncRNA and protein-coding genes. Furthermore, the histone marks, H3K4me3 and H3K36me3 and H3K27me3, correlate with the expression of lncRNA in a manner similar to that of mRNA. However, the repressive marks DNA methylation and H3K9me3 histone marks do not seem to be involved in the expression of lncRNAs.