Input data for modeling

Our raw datasets were WIG format files

^{[16]}, which are designed for display of dense continuous data. Each file contains signal information for H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K27ac, H3K36me3, H3K20me1, and CTCF in CD4

^{+} T cells. Because nucleosome core particles consist of 147 base pairs (bp) of DNA and the DNA sequences that separate each pair of nucleosomes are approximately 50 bp, we first divided the whole genome into 200-bp non-overlapping intervals. Then we analyzed whether the aforementioned marks were present using Ernst's signal files and thresholds

^{[16]}. We identified 24 sequences

with length

*T*_{c}, corresponding to 24 chromosomes. The position of each sequence indicated a combination of marks in the corresponding interval, i.e.

, where

is binary; if the signal of chromatin mark at position of chromosome is higher than predefined threshold,

, otherwise

.

*M* is the number of chromatin marks.

To identify regulatory elements that correspond to chromatin marks, we extracted known genes (

*N* = 39,991 genes) in each interval using annotations from Refseq and then extended 5 kb on both sides of selected genes to cover possible regulatory elements. After extracting chromatin marks of these extended gene regions from

,

*s* = 1, …, 24, there were

*N* = 39,991 subsequences of combinations of chromatin marks,

,

*c* = 1, …,

*N*, which comprised the primary dataset used in this study. In this equation,

denotes the combination of marks at position

*t* of gene

*c*, extracting for

,

*s* = 1, ..., 24. For fast and stable model learning, we randomly screened out 0.5 percent of the extended gene regions with lengths between 11 kb and 110 kb. These genes were selected from different chromosomes and long enough (6.023 Mb) to determine different chromatin states.

The probabilistic model

Our first model was a multivariate HMM that is a sequence version of naive Bayes. This type of generative model is based on the joint distribution of input observations and output hidden states that we wish to predict. Let chromatin states be hidden states of HMM. Each state depends only on its immediate predecessor described by a transition probabilities *b*_{ij} from state *i* to state *j*. At each time point of this hidden sequence, a chromatin state emits a corresponding combination of marks from a distribution described by a product of independent Bernoulli distribution ().

Let

*K* be the number of hidden states,

be the unobserved state sequence corresponding to sequence

*g*_{c}.

*p*_{k,m} (

*k*=1, …,

*K*,

*m*=1,…

*M*) denote the probability that the mark occurs when the current state is

*K*, and

be the specific combination of marks at point

*t* of hidden sequence

*g*_{c}, the probability of an observation is

The distribution of the initial state

*S*_{0} in each extended gene region is,

*P* (

*S*_{0} =

*i*)=

*α*_{j},

*i*=1, …,

*K*. Then the likelihood function (joint distribution) is

The second probabilistic model is based on a multivariate instance of a CRF model

^{[18]}^{,}^{[19]} that is a sequence version of logistic regression. This kind of discriminative model is based on the conditional distribution of output hidden variables given input observations. They don't need to model

*p*(

*x*). This character allows the CRF models to be applied to a wide range of applications. In our application, the CRF model was defined as

In equation (3),

*Z*(

*g*_{c}), a normalization function, was defined as

. Factor

*ψ*(

*y*_{c},

*g*_{c}) integrates the transition potential function

between state

and state

and local evidence function

:

where I is indicator function. In our study,

was treated as parameterized:

The feature function in a CRF model may depend on observations from any time point, thus making

very flexible. For this study, we chose the following feature function:

where

This definition links the current hidden state with the three nearest observations (), which allowed us to incorporate context information to find chromatin states. From this point of view, the CRF model is more powerful than the HMM.

Model learning

For the HMM, we used the Baum-Welch algorithm to maximize its likelihood function. Because the local algorithm is expectation-maximization (EM), the result of HMM largely depends on the initial values of the parameters. To reduce the effect of the local maximum, we used the fuzzy c-means (FCM) method to get initial values of the parameters {*α*^{0},*b*^{0},*p*^{0}}. Then a local optimum of the parameters was found using the standard EM based Baum-Welch algorithm.

For the CRF model, we used an iterative learning method to infer the model parameters {*θ*_{ij}, *i*,*j* =1, *k*, *p*_{k}_{,m}, *k*=1,…,*K*, *m*=1,…,*M*} and *y*_{c}, *c*=1,…,*N*:

Step 1: Initialize the parameters of the CRF model using FCM, and use this model to infer hidden states *y*_{c}, *c*=1,…,*N*;

Step 2: Train the model and update parameters using inferred hidden states;

Step 3: Infer, *y*_{c}, *c* = 1, …, *N* using new model; and

Step 4: Repeat step 2 until convergence.

Model assessment

First, we used trained HMM and CRF models to predict chromatin states of extended gene regions on test data. We are going to use a simple cross-validation to get a preliminary assessment of the stability of these two models.

Let

*f*_{ij} be the frequency of a chromatin mark associated with each state, defined as

After training and prediction, we obtained two frequency matrices *F*_{train} and *F*_{test}. For the CRF model or HMM, the generalization performance was measured by

Bias =|F_{train}/100–F_{test}/100|_{2}.

Second, we used external gene annotation data to validate the inferred results from our models. We started by using transcription start site data. Gene annotations were the Refseq annotations^{[20]} obtained from the University of California Santa Cruz (UCSC) genome browser. After determining the transcription start site from known genes, we extended them on both sides to get ±*2k* (*k* = 2,000) transcription start site regions that may contain many regulatory elements, especially promoters and enhancers. This true may help us to recognize histone modification combinations that characterize promoters and enhancers. Further, after using CRF or HMM, hidden states in transcription start site regions are figured out. If these states occur in other regions, such as introns, sites corresponding to these states are putative enhancers. We also used DNase I hypersensitivity data produced by DNase-chip^{[21]}. This Encyclopedia of DNA Elements (ENCODE) data describe areas hypersensitive to DNase as determined by assay in a large collection of cell types. Regulatory regions in general and promoters in particular tend to be DNase sensitive. Instead of using the whole set of data, we chose DNase hypersensitive areas whose scores exceeded the mean score of all DNase hypersensitive areas. Similarly, we detected some states to be enriched in transcription start site regions. These states may be putative enhancers or promoters.

For enrichment, we used following definition. Let

*P*_{Intersect} be the sum of the posterior probabilities of a state in intervals intersecting the external data source and

*P*_{Whole} be the sum of the posterior probabilities of this state over all 200-bp intervals. Then, state enrichment (

*E*) was defined as

Further, we studied the relative position of these states enriched in transcription start sites regions. Some of them concentrated in the upstream, and other was in the downstream. An index called “fold percentage” was defined to demonstrate this character. Let

*l*_{1} be the frequency of a state present in a specific 200-bp interval of a transcription start site regions,

*l*_{2} be the frequency of a state present in all intervals of transcription start site regions,

*l*_{3} be the number of 200-bp intervals in transcription start site regions, and

*l*_{4} be the total number of 200-bp intervals in all gene regions. Then, fold percentage was defined as

High “fold percentage” means high concentration at corresponding position. We plotted the curve of fold percentage in transcription start site regions and found the position of a state with respect to the transcription start site.