Interactions between proteins and DNA facilitate and regulate many basic cellular functions, including transcription, DNA replication, recombination, and DNA repair. For example, the process of transcription is regulated by a class of proteins referred to as transcription factors, which often bind to specific DNA sequences upstream of gene coding regions. This control mechanism allows cells to respond to developmental or environmental signals by using the same transcription factor to coordinate expression of many genes. Therefore, it is of interest to determine where regulatory proteins of this and other types are bound to the genome.
The genomic-binding location of transcription factors can be determined using chromatin immunoprecipitation (ChIP) followed by detection of the enriched fragments by DNA microarray hybridization. This procedure, also known as ChIP-chip, has been reviewed extensively [1
]. To appreciate the unique properties of the data generated by the ChIP-chip procedure, it is useful to review briefly the main points of the experimental procedure (Figure ).
A summary of the ChIP-chip procedure. See the text for details.
After growing the cells of interest under the desired conditions, chromatin is usually cross-linked with formaldehyde to preserve sites of interaction between proteins and DNA. The cross-linked chromatin is then sheared by sonication or enzymatic digestion. Shearing creates a population of chromatin fragments of varying size, generally ranging from 200 to 1,000 base-pairs. The protein of interest, along with the DNA associated with it, is then isolated by using an antibody specific to that protein or by affinity purification utilizing an epitope or affinity tag fused to the protein. The ChIPed DNA is then purified. Because yields from most samples are low, amplification is often required. DNA fragments enriched in the procedure are then detected by comparative hybridization to a DNA microarray. Standard technical recommendations common to all microarray experiments (for example, the need for dye swaps) apply equally to ChIP-chip experiments. The result of the hybridization allows one to identify which segments of the genome were bound by the protein of interest during immunoprecipitation.
The interpretation of data generated by a ChIP-chip experiment is in many respects similar to interpretation of traditional gene expression microarrays, but it differs in two important ways. First, in traditional expression experiments, each element on the microarray measures the abundance of RNA molecules of a fixed length. (Note that we shall use the term 'arrayed elements' hereafter to describe DNA fragments that are deposited on the surface of the array; the term 'probe' is sometimes used by others.) In contrast, with ChIP-chip experiments each element measures the abundance of a population of fragments of various lengths due to the effects of chromatin shearing. As a consequence, arrayed elements representing genomic regions both at the binding site and near the binding site will detect enrichment (Figure ).
Figure 2 The neighbor effect and calculation of P values. (a) After ChIP, purified DNA fragments bound by the protein of interest will be of various lengths. (b) Actual log2 ratios reported by arrayed elements for Rap1p binding to promoter region of RPL1B (array (more ...)
Depending on the method and degree of chromatin shearing, and the resolution of the arrayed elements, this effect produces a 'peak' of signal centered over the binding site, which may span several arrayed elements representing genomically adjacent DNA. This 'neighbor effect' is not an expected property of noise or other spuriously high ratio measurements, and thus is a source of information that can be used for analysis.
The second difference in the interpretation of ChIP-chip and traditional gene expression data is that in expression experiments, the data are two-tailed and roughly symmetric. That is, there is biological significance associated with both low and high ratio measurements, and these measurements often occur with similar frequencies. In contrast, the measurements derived from ChIP-chip experiments arise as a mixture of two distributions. The first corresponds to the population of genomic fragments specifically enriched by the ChIP, and the second corresponds to the remaining population of genomic DNA that is not ChIP enriched and therefore represents background, or noise. The observed distribution of the log2
ratios is therefore asymmetric about zero, with a distinct, positively oriented skew (Figure ). The left-hand side of the distribution (the negative log ratios) is approximately Gaussian, but the positive log ratios exhibit a heavier non-Gaussian tail. For the vast majority of ChIP-chip experiments, the genomic regions of biological interest will be confined to the positive side of the distribution, and the negative log ratios will arise solely from fragments that are considered to be background. Under the additional assumption that the distribution of unenriched fragments is symmetric about zero, we can estimate the distribution of background ratios using only the observed negative log ratios as a guide [6
Figure 3 Characteristics of ChIP-chip data. (a) A quantile-quantile plot (QQ plot) for one representative Rap1p ChIP-chip experiment (red) against Gaussian distribution with a standard deviation of 0.35 and a mean of 0 (black bars). The upper and lower bounds (more ...)
The type of microarray used in a ChIP-chip experiment affects how the data can be analyzed. Two array designs are typically used for ChIP-chips: tiled or promoter-specific arrays. Promoter-specific arrays generally contain a single arrayed element to represent each regulatory region of interest. These arrays are valuable when binding is known to be confined to regulatory sequences close to transcriptional start sites of the selected genes [7
], but they become less powerful when binding is not as well characterized or is spread over a large genomic area. The other type, namely tiled arrays, are best suited to ChIP-chip. The term 'tiled array', or sometimes 'tiling-path array', refers to arrays containing DNA fragments designed to cover large genomic regions or whole chromosomes with few or no gaps between arrayed elements [8
]. Tiled arrays are advantageous because they do not require prior knowledge of potential binding targets, and they allow one to utilize the 'neighbor effect' in data analysis.
In this report we describe ChIPOTle (Chromatin ImmunoPrecipitation On Tiled arrays), software created expressly for the analysis of ChIP-chip data obtained using tiled arrays, which allow us to exploit both the 'single-tail' and 'neighbor effect'. ChIPOTle uses a sliding window approach to identify potential sites of enrichment, and then estimates the significance of enrichment for a genomic region using a standard Gaussian error function. ChIPOTle is delivered as a Microsoft Excel macro written in Visual Basic, which should facilitate widespread adoption and provide a platform for custom applications. Before ChIPOTle, to our knowledge the only publicly available program designed expressly for ChIP-chip data analysis was PeakFinder [10
]. ChIPOTle offers several improvements, including accurate and powerful P
value estimation and improved usability. ChIPOTle is available online (Additional data file 1) [11