Chromatin immunoprecipitation, or ‘ChIP’, allows for the capture of the binding events between transcription factors or other DNA binding proteins and their targets in vivo
at the moment of biochemical cross-linking. With the development of ‘ChIP-on-chip’ technology, the near genome-wide location analysis of binding sites for transcription factors became a reality (1
). While this technology has greatly improved our understanding of transcriptional regulation in mammals, it is limited by the type of microarray platform used for the hybridization, in terms of spatial resolution and genomic regions that can be covered (1–3
). ChIP-Seq technology addresses these issues, providing sequences for target regions anywhere in the genome with dramatically improved spatial resolution (2–9
While ChIP-Seq technology offers many advantages over ChIP-on-chip, the large amount of data produced from each run (>1 Gb of sequence) poses a challenge for the accurate identification of transcription factor binding sites (3
). Over the last few months, a number of new methods have been released which attempt to address these challenges (8–17
). Many initial approaches did not employ control (i.e. input derived) datasets to eliminate falsely called binding regions that occur due to sequencing biases (6
). More recent methods enable the user to specify a control data set to eliminate false positive regions that result from these biases (8
). For instance, the CisGenome software system uses a conditional binomial model to identify enriched regions when a control data set is provided and includes an option for incorporating sequence strand information (16
). MACS (Model-based Analysis of ChIP-Seq) uses the control dataset to model the tag distribution across the genome using the Poisson distribution (λBG
). After identifying candidate peaks that are significantly enriched over λBG
, a local λ is estimated using windows around each peak to eliminate local biases (14
). The PeakSeq algorithm is a two-step process that first identifies regions enriched compared to a null background model, and then returns regions that are statistically significant after taking ‘genome-mappability’ and control data into account (17
). QuEST (Quantitative Enrichment of Sequence Tags) employs control data sets to eliminate false positive regions, and also to estimate a false discovery rate (FDR) (13
). QuEST first calculates a ‘peak shift’ based on profiles generated from forward and reverse sequence tags reads. Once the shift is estimated, profiles are combined and peaks are called based on the enrichment of ChIP sequence tags to control sequence tags in the same region. The default parameter settings for QuEST are very stringent, yielding very small numbers of targets if the antibody used for ChIP provides only weak to moderately enriched regions. The SISSRS (Site Identification from Short Sequence Reads) algorithm utilizes sequence strand information to identify binding sites, which eliminates false positives. However, this approach may be too stringent for some applications, as only the strongest binding sites will contain sufficient sequence tags to fit the SISSRS model. When ChIP-Seq is used to identify binding regions for a transcription factor that has not been well studied, adjusting parameters without knowing the expected number of binding sites and without knowing the affinity of the antibody can be a difficult task.
Here, we introduce a novel method, termed GLITR (GLobal Identifier of Target Regions), to address some of the important issues with ChIP-Seq analysis. GLITR randomly samples sets of control sequence tags to accurately estimate a fold-change for each region identified in a target dataset. Following fold-change calculation, GLITR uses a classification method that incorporates two values, peak height and fold-change, to identify regions that are enriched above a specified FDR, which is calculated by comparing ChIP classification results to pseudo-ChIP (a sample of control sequence tags) classification results. By combining two attributes of a region GLITR greatly improves the ability to distinguish signal from noise in ChIP-Seq data. This is important because solely using peak height to identify targets leads to inclusion of multiple false-positives corresponding to regions that are also sequenced in control samples. Likewise, relying only on fold-change values is problematic, because a high-fold change cutoff eliminates many targets while a low fold-change, which is common in pseudo-ChIP data, drastically increases the number of false positives. After discussing the importance of using control DNA in ChIP-Seq experiments, we establish that sequencing input DNA from different tissues yields comparable results. We then compare the ability of GLITR to identify binding regions in ChIP-Seq data, obtained from sequencing Foxa2 ChIP material from adult mouse liver, to current published methods. We show that while all methods are able to identify regions that have the strongest Foxa2-binding sites, when moving deeper into a target list only GLITR continues to discover regions with a strong match to the Foxa2 consensus. Additionally, we show that the experimental design used to obtain sufficient sequencing tags greatly influences the regions identified as occupied by a transcription factor.