The development of a human body from a single fertilized egg is a spatially and temporally regulated complex process. The genes that are responsible for general cellular function are expressed in all cell-types and tissues. However, in many tissue/cell-types, specialized functions require or exclude the expression of certain genes. The mechanism of this tissue/cell-type specific regulation (TCSR) is rather intriguing. It is worth noting that such diverse expression patterns are achieved through one genome shared largely by all cells. Gene transcription is regulated in multiple layers, e.g. transcription factor binding through DNA nucleotide features, DNA methylations, and chromatin modifications. TCSR may involve combinations of these regulations in all layers (for review [1
Thanks to next generation sequencing technology, our understanding of human TCSR has accelerated in recent years. At the base layer of DNA features, the association between DNA regulatory elements, such as TATA box and CpG islands in the promoter regions, and tissue-specific regulation has been investigated experimentally [1
] and computationally [4
]; Tissue-specific regulatory transcription factor binding sites in the promoter regions have been well studied in muscle [5
] and liver [6
], and binding sites were also detected in multiple tissues using generic transcription factor binding site prediction tools [7
]. Cell-type specific enhancers have been experimentally explored in several cell types as well [10
]. High-throughput Cap Analysis of Gene Expression (CAGE) data showed that alternative transcription start sites (TSS) exist in the mammalian genome with more prevalence than previously thought [11
], and, moreover, distributions of TSS have also been associated with TCSR [12
]. Recently, genome-wide mapping of Histone Modifications and Variants (HMVs) in CD4+ T cells [13
], as well as other cell types [15
], opened up an opportunity to model gene expression levels from the perspective of post-translational modification of histones [16
]. For example, Pekowska et al.
clustered genes by their H3K4me2 profile at the promoter regions in CD4+ T cells. They found that a cluster was enriched in CD4+ T cell specific genes [17
]. However, a comprehensive picture on how posttranslational modifications of histones contribute to TCSR is still not clear.
Therefore, in this work, we addressed three major questions 1) which HMVs carry sufficient information to allow TCSR target gene prediction, 2) whether TCSR is the same as gene expression activity regulation, and 3) whether the predictive relationship between HMV and TCSR target genes is universal for entire Pol II transcriptome. To properly address these questions, we developed a quantitative model to link the HMVs and TCSR target genes using CoreBoost, and applied it to recently published genome-wide mapped HMVs in CD4+ T cells [13
]. CoreBoost is a previously developed boosting classifier [18
] that can select informative features from an ensemble of weak classifiers. We first show that HMV profiles in both proximal promoters and gene bodies are predictive for CD4+ T cell specificity. The most predictive HMV types have been identified for CpG- and nonCpG-related genes in promoters and gene bodies. The evidences have shown that the underlying enhancers and intragenic alternative promoters marked by the HMV patterns were associated with tissue/cell-type specific gene expression. Second, we demonstrated that TCSR is different from the regulation of gene expression activity. Finally, the model, which was trained on HMV data of protein-coding genes in CD4+ T cells, successfully predicted muscle cell specific genes and CD4+ T cell specific microRNA genes.