Chromatin immunoprecipitation (ChIP) followed by genomic tiling microarray hybridization (ChIP-chip) or massively parallel sequencing (ChIP-seq) are two of the most widely used approaches for genome-wide identification and characterization of in vivo
protein-DNA interactions. They can be used to analyze many important DNA-interacting proteins including RNA polymerases, transcription factors, transcriptional co-factors, and histone proteins [1
]. Indeed these genome-wide ChIP analysis approaches have led to many important discoveries related to transcriptional regulation [2
], epigenetic regulation through histone modification [5
], nucleosome organization [6
], and interindividual variation in protein-DNA interactions [8
ChIP-chip first appeared in the literature about 10 years ago and was one of the earliest approaches to performing genome-wide mapping of protein-DNA interactions in organisms with small genomes, such as yeast [2
]. Currently, various tiling microarray platforms of common model organisms are well supported by commercial vendors, and many bioinformatics tools have been developed for ChIP-chip analysis [11
]. Fueled by rapid development of the second generation high-throughput sequencing technologies in the past few years, ChIP-seq has emerged as an attractive alternative to ChIP-chip [1
]. For instance, ChIP-seq generally produces profiles with higher spatial resolution, dynamic range, and genomic coverage, allowing it to have higher sensitivity and specificity over ChIP-chip in terms of protein binding site identification. Further, ChIP-seq can be used to analyze virtually any species with a sequenced genome since it is not constrained by the availability of an organism-specific microarray. Many current ChIP-seq protocols can work with a smaller amount of initial material compared to ChIP-chip [15
]. Moreover, ChIP-seq is already a more cost-effective way of analyzing mammalian genomes, and the cost effectiveness will likely become more apparent as the cost of high-throughput sequencing technology continues to drop. These factors have led to the rapid adoption of ChIP-seq technology.
However, despite the widespread use of both ChIP-chip and ChIP-seq, only a few small-scale studies have attempted to quantitatively compare these technologies using real data. Euskirchen et al. [17
] compared the STAT1 binding sites identified by ChIP-chip and ChIP-PET (paired-end ditag sequencing by Sanger sequencing technology) and found that there was a good overall agreement between the two technologies, particularly at identifying highly ranked enrichment regions. They nonetheless noted specific discrepancies in regions associated with repetitive elements, which can be attributed to lack of microarray probe coverage or misalignment of ChIP-PET reads. More recently, a number of studies compared genome-wide transcription factor binding datasets generated from ChIP-chip with those generated from ChIP-seq [18
] (see Additional file 1
: Table S1). The general conclusions from these studies were that binding profiles generated from ChIP-chip and ChIP-seq were largely correlated at the genome-wide level, and that ChIP-seq had superior sensitivity and specificity over ChIP-chip in terms of binding site identification as determined by motif enrichment or quantitative PCR validation. It was also found that the strongest peaks were more likely to be detected by both technologies. However, only a few pairs of ChIP-chip/ChIP-seq profiles were analyzed in these studies, and their focus was mainly on the ability to identify narrow enrichment regions using specific peak calling algorithms. As shown previously [23
] and in this study, peak identification can be strongly dependent on the analysis algorithm, so other more general comparison metrics should be used.
In addition, little is known about the technology-specific variation for analyzing histone modification data. ChIP-based histone modification data is commonly used to reconstruct average signal profiles, or "epigenetic signatures," of key genomic regions such as the transcription start and end sites, but the impact of using ChIP-chip versus ChIP-seq data for constructing epigenetic signatures is largely unknown. Furthermore, it is also important to understand technology-specific biases associated with high-throughput sequencing. Recent studies indicated that the distribution of cross-linked and sonicated DNA fragments (input DNA) was affected by chromatin structure, copy number variation, occurrence of genomic repeats, mappability, genomic location, gene expression activity, and genomic GC content variation [24
]. Since input DNA is commonly used as a background control for a ChIP-seq experiment, it is important to assess how such variation affects the analysis of ChIP-seq data.
Therefore a thorough understanding of the technological variation between ChIP-chip and ChIP-seq is important in experimental design and data analysis. In this study, we compiled and analyzed 31 pairs of ChIP-chip/ChIP-seq profiles of technical replicates across eight immunoprecipitation (IP) factors (CBP, RNA PolII, and six histone modifications) at four developmental stages of the common fruit fly Drosophila melanogaster
(Table ) as part of the model organism Encyclopedia of DNA Elements (modENCODE) project [27
]. In addition, our compiled dataset comprises another 62 ChIP-chip profiles (biological replicates) in the same set of biological conditions (i.e., three ChIP-chip biological replicates at each developmental stage/IP combination), nine sequencing profiles of input DNA, and four pairs of ChIP-seq/ChIP-seq replicates (Table ). Agilent's tiling microarray (Agilent custom 3X244K Dmel Whole Genome Tiling Microarray) and Illumina's Genome Analyzer II platforms were used to generate the ChIP-chip and ChIP-seq data, respectively. All data used in this study were generated as part of the modENCODE project, and are accessible from NCBI GEO (accession numbers: GSE15292, GSE16013, and GSE20000). The goal of this study was to quantify reproducibility within and between profiles generated using ChIP-chip and ChIP-seq approaches, and to pinpoint the source of variation between the technologies, which ultimately should provide useful information for experimental design and data analysis.
Summary of the ChIP-chip and ChIP-seq profiles analyzed in this study
Summary of the additional ChIP-seq profiles analyzed in this study