Chromatin state annotation using combinations of chromatin modification patterns has emerged as a powerful approach for discovering regulatory regions and their cell type specific activity patterns, and for interpreting disease-association studies1-5. However, the computational challenge of learning chromatin state models from large numbers of chromatin modification datasets in multiple cell types still requires extensive bioinformatics expertise making it inaccessible to the wider scientific community. To address this challenge, we have developed ChromHMM, an automated computational system for learning chromatin states, characterizing their biological functions and correlations with large-scale functional datasets, and visualizing the resulting genome-wide maps of chromatin state annotations.
ChromHMM is based on a multivariate Hidden Markov Model that models the observed combination of chromatin marks using a product of independent Bernoulli random variables2, which enables robust learning of complex patterns of many chromatin modifications. As input, it receives a list of aligned reads for each chromatin mark, which are automatically converted into presence or absence calls for each mark across the genome, based on a Poisson background distribution. An optional additional input of aligned reads for a control dataset can be used to either adjust the presence or absence threshold, or as an independent input feature (Supplementary Note). Alternatively, the user can input files that contain calls from an independent peak caller. By default, chromatin states are analyzed at 200-base pair intervals that roughly approximate nucleosome sizes, but smaller or larger windows can be specified. We have also developed a new parameter initialization procedure that enables relatively efficient inference of comparable models across different numbers of states (Supplementary Note).
ChromHMM then outputs both the learned chromatin state model parameters and the state assignments for each genomic position. The learned emission and transition parameters are returned in both text and image format (Fig. 1), automatically grouping states with similar emission parameters or proximal genomic locations, although a user-specified reordering can also be used (Supplementary Fig. 1-2, Supplementary Note). ChromHMM enables the study of the likely biological roles of each chromatin state based on enrichment in diverse external annotations and experimental data, shown as heat maps and tables (Fig. 1), both for direct genomic overlap and at various distances from a state (Supplementary Fig. 3). ChromHMM also generates custom UCSC genome browser tracks6 showing the resulting chromatin state segmentation in dense view (single color-coded track), or expanded view (each state shown separately) (Fig. 1). All the files ChromHMM produces by default are summarized on a webpage that it also creates (Supplementary Data).
ChromHMM also enables the analysis of chromatin states across multiple cell types. When the chromatin marks are common across the cell types, a common model can be learned by a virtual ‘concatenation’ of the chromosomes of all cell types. Alternatively a model can be learned by a virtual ‘stacking’ of all marks across cell types, or independent models can be learned in each cell type. Lastly, ChromHMM supports the comparison of models with different number of states based on correlations in their emission parameters (Supplementary Fig. 4).