Chromatin (Ch) Immunoprecipitation (IP) microarray (ChIP-Chip) assays use microarrays of DNA sequences to measure specific DNA-protein interactions, with a goal of discovering the genomic locations of transcription factor binding sites (TFBSs). Transcription factors (TFs) are proteins that regulate the expression of nearby genes by binding to DNA. TFs bind to TFBSs that are usually on the order of 10–20 nucleotides in length, and even in relatively small genomes, the binding sites occur in hundreds of locations (
Buck and Lieb, 2004). However, motif discovery methods cannot be directly applied to the genome in order to find TFBSs because of the number of false positive binding site matches that would result. ChIP-chip data allows one to narrow the search from the whole genome to only those regions where binding has likely occurred
in vivo.
In chromatin immunoprecipitation experiments, the TF of interest binds to the DNA
in vivo under controlled conditions, and the protein-DNA complexes are fixed or crosslinked and extracted. The DNA is sheared into approximately 1kb fragments by sonication. Next, an antibody specific to the TF of interest selectively binds to the protein-DNA complexes of interest, and this entire complex precipitates out of solution. The DNA precipitate is then extracted, the crosslinks are reversed, it is universally amplified, and fluorescently labeled. This
IP sample is enriched for DNA fragments that contained a binding site. Reference samples of the input DNA fragments that do not go through the IP process are used as controls, and either two-color microarrays (
Buck and Lieb, 2004) or high density oligonucleotide arrays (
Kapranov et al., 2002;
Cawley et al., 2004) compare the DNA present in the IP and the reference sample at each DNA segment that has a corresponding probe. If a probe or continuous region of many probes has higher intensity in the IP sample than the reference, it is said to be relatively
enriched. The sequences of the enriched regions are often searched for the presence of TFBSs.
TFBSs generally do not match an exact sequence, and but are usually represented by a 4×
w position specific weight matrix (PSWM) Θ where the four rows represent the nucleotides A, C, G and T and the
w columns represent the
w motif positions (
Liu et al., 1995). The element Θ
ij is the probability that the nucleotide at position
j of the sequence is
i, i ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
{
A,C,G,T}. Searching for patterns of several base pairs within segments of DNA that might be several thousand base pairs long can lead to many false positive matches because there are thousands of potentially similar sites within a single DNA segment. This multiplicity greatly increases the computational burden, especially if many DNA segments are considered simultaneously. Further, the “background” DNA sequence that does not contain binding sites generally has a highly non-random distribution of nucleotides. The computational and statistical challenges of motif discovery have led to the development of a number of statistical model-based methods for motif discovery (
Liu et al., 1995;
Gupta and Liu, 2003;
Zhou and Liu, 2004) as well as computationally fast and partially heuristic methods (
Liu et al., 2002;
Buhler and Tompa, 2002).
The analysis of ChIP-chip data is typically done through a two step approach. The first step deals with the ChIP-chip array data, and it analyzes the probe intensities to find the regions of enrichment. The second step uses the genomic sequences of the regions found in the first step to estimate the motif. Next, we describe the main features and form of ChIP-chip data, and discuss currently available methods for its analysis.
1.1 Data structure
There are two main technologies used for ChIP-chip experiments: (i) a two-color system in which the IP sample is labeled with one fluorescent dye and the reference sample is labeled with a different dye and applied to the same array, and (ii) the oligonucleotide (e.g. Affymetrix) array, in which the IP sample is applied to one set of arrays, and the reference sample is applied to a different array set. In this article, we focus on two-color ChIP-chip data. The probes on the two-color arrays range from about 100 to 2,000 base pairs in length. For each probe
p on each array, there are two measurements: one for the IP sample intensity IP
p and one for the reference sample intensity Ref
p. The variation due to the random error of a specific probe’s measurement is reduced by taking the ratio of IP
p/Ref
p which removes the multiplicative effect of probe
p that is common to both IP
p and Ref
p (
Rocke and Durbin, 2001). Enrichment implies that log(IP
p/Ref
p) > 0 for a given probe
p.
The full ChIP-chip experiment can be represented as an
P ×
R matrix
Y where microarray replicates are indexed
r ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
[1 …
R], and the probes are indexed by
p ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
[1 …
P]. A row of this matrix which contains all measurements from a probe is denoted as
yp. The number of probes
P ranges from 10,000 to 1,000,000 in different experiments, and the number of replicates
R is small, usually between 1 and 10. The
rth element of
yp is denoted as
ypr and is the log-ratio of the IP sample intensity and the reference sample intensity, that is,
ypr = log(IP
pr/Ref
pr). A schematic of the data is shown in . The values of
ypr that are higher are more likely to be IP enriched. The histogram of average values of
yp () from a yeast RAP1 experiment (
Lieb et al., 2001) shows that the averages can be thought of as a mixture of the enriched and the not enriched probes. The sequence that corresponds to probe
p will be denoted as
xp. The consecutive probes are adjacent on the genome, and the fragments which hybridize to the probes correspond to the complementary sequence.
xp is a sequence of A’s, C’s, G’s, and T’s with length
Kp. A subsequence of
xp from position
j to position
k will be denoted as
xp[
j : k]. The probe sequences can range from a few hundred to several thousand base pairs in length, but the resolution of each probe is limited by the size of the applied DNA fragments. ChIP-chip analysis should also consider the spatial correlation
between probes that represent adjacent loci. Probes are correlated if the genomic distance between the probes is less than the length of the DNA fragments in the sample. Correlation between adjacent probes is a prominent feature of the data because the DNA fragments applied to the arrays may span two or more probes (
Buck and Lieb, 2004).
1.2 Current methods for analyzing ChIP-chip data
Another approach to finding regions of enrichment is to use Hidden Markov Models (HMMs).
Ji and Wong (2005) developed a nonparametric method called Unbalanced Mixture Subtraction (UMS) to estimate the emission densities and the FDR within an HMM.
Li et al. (2005) proposed an HMM with the same state space, but used normal distribution models for the emission densities.
Li et al. (2005) and
Ji and Wong (2005) both demonstrate the superior performance of the HMM method over moving average models in terms of power for detecting IP enrichment for small sample sizes. One limitation with both of these methods is that the transition probabilities and the emission densities are not estimated simultaneously by the HMM so that the user must select values which may lead to suboptimal performance.
Keles (2007) proposed a hierarchical mixture model for detecting regions of IP enrichment, that considers regions of adjacent probes collectively in order to take advantage of the correlation between probes. A similar robust hierarchical method was proposed by
Gottardo et al. (2006), allowing for probe specific differences in error variance, and using a normalization procedure called Model-based Analysis of Tiling-array (MAT) for oligonucleotide ChIP-chip (
Johnson et al., 2006). Normalization methods for two color ChIP-chip data are inherently difficult, due to the skewed nature of ChIP-chip log ratios
Buck and Lieb (2004). Recently,
Zheng et al. (2007) advanced a model for the estimating for the shape of the probe intensity enrichment peaks. This approach has the advantage adaptively pooling information from adjacent probes based upon data-driven peak shape estimates.
In most two step procedures, the first step estimates the probe’s enrichment probability based on the intensity information. Probes that exceed some cutoff based upon the intensity model are submitted to a second step that estimates the probes’ binding site probability based on the probe sequence. However, the probe intensity and sequence are not independent–that is, probes with with or near TFBSs have a higher likelihood of being enriched. Using a intensity based cutoff in the first step that does not consider the probe’s surrounding sequence is likely to miss some probes with TFBSs. Ignoring the probe level information when considering the sequence could also lead to biases, as the probe measurement may be informative towards the extent of actual binding. It thus seems reasonable to consider a joint model of the ChIP-chip measurements and sequence for more accurate estimation.
Shim and Keles (2007) propose a procedure that analyzes the genome sequence conditionally upon the probe intensity information. Their method uses the measurements from the probes within a conditional two-component mixture model. The model estimates the probability of a TFBS occurrence at a particular base pair in the genomic sequence conditionally upon a smoothed average of the probe level intensity statistics. The relationship between the ChIP-chip value of probe
i, Ti, and the indicator variable,
Zi, denoting whether a motif starts at position
i, is modeled as logit[
Pr(
Zi = 1|
Ti)] = β
0 + β
1Ti. Averaging of
Ti’s between probes only approximately accounts for spatial correlations, unlike an HMM. A more debatable assumption is that the ChIP-chip value,
Ti is taken to be without error despite the uncertainty inherent in estimation. The estimation procedure appears likely to suffer from limitations of the EM-based algorithm, for example multimodality traps, as well as the inability to capture multiple binding sites close together.
Shim and Keles (2007) use their technique as a refinement procedure after the bound regions have already been selected, rather than allowing the sequence model to inform the probe measurement model.
In this article, we propose a joint model for ChIP-chip and sequence data for TFBS discovery, accounting for the uncertainty inherent in both steps. The probe level information, followed by a motif discovery step, is used to generate initial motif candidates under the assumption that the “best” binding sites are more likely to be present close to the highly enriched probes. However, once the initial estimates are obtained, the entire set of data is fit using a novel joint model (Section 2). As we see later in data analyses (Section 4), our framework succeeds in finding many TFBSs that are missed by two-step procedures. The motivation for the joint model is to minimize the number of probes close to binding sites, that fail the intensity cutoff. We use the probe level information to allow (i) probe level experimental/measurement errors not to bias the observations in case a functional binding site is truly present, but observed binding is not significant enough, and (ii) the intensity of binding to have an influence on our assessment of the strength of a binding site. There could be minor errors due to some regions containing motif matches that show no high intensities, but the joint model should not allow too many of these regions to bias the analysis. In addition, our model may be easily extended to more than one motif of interest.