Although chromatin immunoprecipitation (ChIP) was first adapted for use with mammalian cells <10 years ago (1
), it is now the gold standard experiment for the identification of a target gene of a particular transcription factor. Recent advances allow investigators to use the ChIP assay to identify and characterize the entire set of binding sites for a given factor. Such large-scale studies of transcription factor binding began using promoter-specific microarrays, a technique called ChIP-chip (3–7
). However, many binding sites will be completely missed on such arrays because some factors localize mainly to regions outside of the tiled core promoters (8
). ChIP-chip has now been extended to the entire human genome using a series of microarrays that contain oligonucleotides spaced ~35–100 nt apart (10
). This gapped spacing is necessary due to the large number of arrays (and thus the large cost) that would be required if overlapping oligomers were used. However, the gapped spacing results in the genome-scale ChIP-chip experiments being less precise in mapping the exact location of a binding site than if overlapping oligomers were used. The latest development, ChIP-seq, which uses the immunoprecipitated sample to create a library that is analyzed using high-throughput next generation sequencers, also provides genome-scale analysis of binding sites (12–15
). Because ChIP-seq is not limited to a specific tiled region but can sample the entire genome, this technique can provide a very precise mapping of a peak location (16
). A comparison of an E2F4 binding site identified in the GMNN
promoter using both ChIP-chip and ChIP-seq is shown in Supplementary Figure S1
. Although both technologies correctly identify the GMNN promoter as a target for E2F4, ChIP-seq provides a more accurate location of the binding site. Since the ChIP-seq technology provides a genome-scale analysis that is less costly than genome-wide ChIP-chip and because it allows for more precise mapping of binding site locations, most investigators are moving to this technology as the method of choice for identifying transcription factor binding sites. However, as described below, like any other technology, ChIP-seq also has issues that must be considered.
The first step in the analysis of ChIP-seq data is to identify all sequenced tags that map uniquely to the genome of interest. For many ChIP-seq experiments, investigators analyze very short reads (e.g. 27 nt). This short-read length can sometimes result in a sequenced tag mapping to more than one place in the genome. If this occurs, the tag will be discarded and not included in peak analyses. In most cases, this is not a problem because the region surrounding the ‘non-unique’ tag contains many unique 27-mers and a peak can still be identified. However, if a peak lies within a large region that is not unique within the genome, it will be completely missed (i.e. it will be a false negative). This is especially problematic for genes that have been duplicated over evolutionary time and thus have several identical (or almost identical) copies that reside in different genomic locations (Supplementary Figure S2
). Another reason that false negatives can arise in ChIP-seq analyses is due to effects of chromatin structure on the fragmentation step. Investigators use either sonication or micrococcal nuclease to digest the chromatin before using it in a ChIP assay. However, the sonication and/or digestion step does not always provide a representative population of fragments in the right size range; this is especially problematic for heterochromatic regions. These regions will be underrepresented in the sequencing library and peak identification can be adversely affected using ChIP-seq (Supplementary Figure S3
). Conversely, just as heterochromatic regions are lost during sample preparation, promoter regions are sometimes artificially enriched. Promoters appear to be more easily fragmented into small chromatin than other regions of the genome and often show up as a small peak in an input sample (17
). However, proper analysis using appropriate input libraries can improve the accuracy of binding site identification (see below for more details). Another problem that must be considered when analyzing ChIP-seq data is that certain regions always appear as peaks in a given cell type, independent of the factor being tested (Supplementary Figure S4
). These false positives can be due to repetitive regions being mis-annotated as unique. This is especially problematic when studying cancer cell lines and tissues, which have many amplified genomic regions. It is critical that these false positives are removed from the set of called peaks. As described below, we have addressed many of these problems by identifying binding sites as regions that are significant over background, independent of sequence density. We present a software package, called Sole-Search, to analyze ChIP-Seq data and determine statistically significant peaks, with minimal false positives and false negatives. We demonstrate the utility of our software by collecting, analyzing and comparing ChIP-seq data for six different human transcription factors/cell line combinations; E2F4, E2F6 and YY1 in K562 cells; YY1 in Ntera2 cells, TCF7L2 (called in this article by its other name TCF4) in HCT116 cells, and TFAP2A (called in this article by its other name AP2α) in HeLa cells; the analyses of these datasets are provided in Supplementary Data S1–S21
, whereas the sgr visualization files and sequenced tag files are available on the UCSC browser.