While the primary DNA sequence of the human genome is ultimately responsible for the encoding and functioning of each cell, numerous epigenetic modifications can modulate the interpretation of this primary sequence. These lead to the diversity of function found across different human cell types, play key roles in the establishment and maintenance of cellular identity during development, and have been associated with roles in DNA repair, replication, and disease. Post-translational modifications in the tails of histone proteins that package DNA into chromatin constitute perhaps the most versatile type of such epigenetic information, with more than a dozen positions of multiple histone proteins and variants each undergoing several distinct modifications, such as acetylation and mono-, di-, or tri-methylation
1, 2.
More than 100 distinct histone modifications have been described, leading to the ‘histone code hypothesis’ that specific combinations of chromatin modifications would encode distinct biological functions
3. Others however have instead proposed that individual epigenetic marks act in additive ways and the multitude of modifications simply serves a role of stability and robustness
4. Understanding which combinations of epigenetic modifications are biologically meaningful, and revealing their specific functional roles, are still open questions in epigenomics, with great relevance to many ongoing efforts to understand the epigenomic landscape of health and disease.
To directly address these questions, we introduce a novel approach for discovering ‘chromatin states’ (;
Supplementary Table 1, Supplementary Fig. 1), or biologically-meaningful and spatially-coherent combinations of chromatin marks, in a systematic
de novo way across a complete genome based on a multivariate Hidden Markov Model (HMM) that explicitly models mark combinations. Biologically these states may correspond to different genomic elements (e.g. transcription start sites, enhancers, active genes, repressed genes, exons, heterochromatin), even though no information about these genomic elements is given to the model as input.
HMMs are well-suited to the task of discovering unobserved ‘hidden’ states from multiple ‘observed’ inputs in their spatial genomic context (see
Online Methods). In our model each state has a vector of ‘emission’ probabilities ( and
Supplementary Figs. 2 and 3), reflecting the different frequency with which chromatin marks are observed in that state, and an associated ‘transition’ probability vector (
Supplementary Fig. 4) encoding spatial relationships between neighboring positions in the genome, associated with spreading of chromatin marks, or functional transition such as between intergenic regions, promoters, and transcribed regions (see
Supplementary Notes, Supplementary Figs. 5 and 6).
We applied our model to the largest set of chromatin marks available to date, consisting of the genome-wide occupancy data for a set of 38 different histone methylation and acetylation marks in human CD4 T-cells, as well as histone variant H2AZ, PolII, and CTCF
5, 6 obtained using chromatin immunoprecipitation followed by next generation sequencing (ChIP-seq) (
Online Methods). To understand the biological importance of the resulting chromatin states, we undertook a large-scale systematic data-mining effort, bringing to bear dozens of genome-wide datasets including gene annotations, expression information, evolutionary conservation, regulatory motif instances, compositional biases, genome-wide association data, transcription-factor binding, DNaseI hypersensitivity, and nuclear lamina datasets.
This work has strong implications for genome-annotation providing an unbiased and systematic chromatin-driven annotation for every region of the genome at a 200bp resolution, which both refines previously-known classes of epigenetic states, and introduces new ones. Regardless of whether these chromatin states are causal in directing regulatory processes, or simply reinforcing independent regulatory decisions, these annotations should provide a valuable resource for interpreting biological and medical datasets, such as genome-wide association studies for diverse phenotypes, and potentially pinpointing novel classes of functional elements.