|Home | About | Journals | Submit | Contact Us | Français|
H.K.J., conceived the study, developed CisGenome GUI and data analysis algorithms, performed data analyses and drafted the manuscript. H.J., developed CisGenome browser. W.X.M., participated in algorithm development and performed data analyses. D.S.J. and R.M.M., generated NRSF ChIP-chip data. W.H.W., conceived the study and drafted the manuscript. All authors read and revised the manuscript.
CisGenome is a software system for analyzing genome-wide chromatin immunoprecipitation (ChIP) data. It is designed to meet all basic needs of ChIP data analyses, including visualization, data normalization, peak detection, false discovery rate (FDR) computation, gene-peak association, and sequence and motif analysis. In addition to implementing previously published ChIP-chip analysis methods, the software contains new statistical methods designed specifically for ChIP-seq data. CisGenome has a modular design so that it supports interactive analyses through a graphic user interface as well as customized batch-mode computation for advanced data mining. A built-in browser allows visualization of array images, signals, gene structure, conservation, and DNA sequence and motif information. We illustrate the use of these tools by a comparative analysis of ChIP-chip and ChIP-seq data for the transcription factor NRSF/REST, a study of ChIP-seq analysis without negative control sample, and an analysis of a novel motif in Nanog- and Sox2-binding regions.
Chromatin immunoprecipitation followed by genome tiling array analysis (ChIP-chip)1-3 or by massively parallel sequencing (ChIP-seq)4-10 are recently developed approaches to study genome-wide transcriptional regulation (see Supplementary Fig. 1 online). By systematically identifying protein-DNA interactions of interest, studies using these technologies provide information on cis-regulatory circuitry underlying various cellular processes. The analysis of the massive and heterogeneous datasets from these studies, however, poses several challenges. These include effective data visualization, seamless connection of low-level (close to raw data) and high-level (close to biological questions) analysis tasks, integration of data from multiple technological platforms, and flexibility to customize the analysis to address specific biological questions. Although there are several recently developed programs11-31 that target some of the individual steps, an integrated tool that can satisfy all basic needs in ChIP data analyses is not yet available (see Supplementary Notes online).
We have developed a set of methods to meet these needs in ChIP data analyses and implemented them in the integrated software CisGenome (Fig. 1). CisGenome provides a wide range of functionalities for ChIP data analyses which can be accessed through a menu-driven system in a graphic user interface (GUI), and the results are automatically linked to the CisGenome browser which is designed for data visualization. CisGenome is a standalone system that bench biologists can use to analyze their own data locally on personal computers. At the same time, most CisGenome functionalities can also be accessed in a command line manner. This modular design allows computational biologists to build large batch jobs for customized analyses on computer servers.
CisGenome can detect binding regions (or peaks) from raw tiling array probe intensities or mapped sequence reads. For example, using the GUI one can directly load Affymetrix CEL and BPMAP tiling array data, examine raw array images to detect hybridization artifacts, normalize data across different arrays, and then detect binding regions (see Supplementary Fig. 2a-c online). CisGenome can also take as input the binding regions/peak scores obtained from other preprocessing programs, such as MAT11 for ChIP-chip and QuEST30 for ChIP-seq data. CisGenome uses TileMap12 for internal ChIP-chip peak calling and FDR estimation (see Supplementary Methods online).
The peak signals including fold changes and summary statistics are reported in tables and linked to CisGenome browser. In the browser, one can visualize the probe-/read- level data together with gene structures, conservation scores and DNA sequences (Supplementary Fig. 2d). One can freely zoom in, zoom out, move left and right, search for genes and regions, add and delete annotation tracks. By clicking a location of interest, one can link to external resources such as NCBI32, UCSC33 and Ensembl34 to obtain more comprehensive information. CisGenome browser also supports visualization of raw array images and sequence logos of motifs. The memory requirement is minimal. This built-in browser makes it easy and efficient to visualize millions of data points without the need to transfer them over the internet to web-services such as the UCSC genome browser which often becomes inefficient in large-scale analyses.
Through the GUI one can associate binding regions to neighboring genes and study statistical properties of the binding regions in relation to various genome annotation features. For example, one can extract the frequency of regions found in exons, introns, UTRs, etc., and summarize the conservation level of each individual binding region (Supplementary Fig. 2e).
CisGenome contains many functions related to sequence and motif analyses. It can be used to retrieve DNA sequences on binding-regions, map transcription factor binding motifs to the genome, and search for novel motifs35 and cis-regulatory modules36. A de novo motif search may return multiple motifs. CisGenome identifies the functionally relevant ones by comparing the occurrence rates of the motifs in binding regions to those in matching genomic control regions37 (Supplementary Fig. 2e-h and Supplementary Methods).
Currently CisGenome supports human, mouse, Drosophila and Arabidopsis for species-dependent analyses (e.g., peak-gene association). Users can add support for other species (Supplementary Methods).
CisGenome has a modular design so that most of its functions can be accessed in command mode as well as from the GUI. The command mode functions can be conveniently embedded into users' own programs. Interfaces that allow users to link their own programs to CisGenome browser are provided. Interfaces that allow users to plug their own tools into CisGenome GUI are under developing.
Download, FAQ, file formats, tutorial and user manual can be found in http://biogibbs.stanford.edu/~jihk/CisGenome/index.htm. Developing language and operating systems are discussed in Supplementary Methods online. We provide source codes to enable customization by users.
CisGenome can handle data from two types of designs common in ChIP-seq experiments, namely, one-sample analysis where only a ChIP'd sample is sequenced5,9, and two-sample analysis4,6,8,10 where both a ChIP'd sample and a negative control sample are sequenced (see Methods and Fig. 2). In one-sample analysis, CisGenome scans genome with a sliding window and picks up those with read counts bigger than a user-chosen cutoff as binding regions. False discovery rates are estimated by modeling the read count in non-binding windows using a negative binomial distribution. In contrast to the constant rate assumed in the widely used Poisson background model, the negative binomial model allows the background rate of occurrence of the reads to vary across genome and to have a more flexible Gamma distribution. In analyses of many datasets, the negative binomial model had provided much better fit to the data than the Poisson model (Fig. 2b,c). A systematic evaluation of the method is provided in Supplementary Data 1, Supplementary Figure 3-7 and Supplementary Table 1-3 online.
In two-sample analysis, where a negative control sample is also available, CisGenome uses a conditional binomial model to identify regions in which the ChIP reads are significantly enriched comparing to the control reads. Windows passing a user-specified FDR cutoff are used to generate predicted binding regions. Both one- and two-sample analyses use the directionality of reads to refine peak boundaries and filter out low quality predictions. These are provided as two post-processing options, namely, boundary refinement and single strand filtering (Fig. 2d).
To illustrate the basic functions provided by CisGenome, we analyzed whole genome ChIP-chip and ChIP-seq datasets generated for the transcriptional repressor NRSF/REST39,40 in Jurkat cells (see Methods). By going through the steps shown in Supplementary Figure 2, the ChIP-chip analysis identified 7,114 binding regions at a 10% FDR level (median length = 616bp). The NRSF motif was successfully discovered by de novo motif discovery and had the highest enrichment level among all the discovered motifs.
We applied both one- and two-sample analyses to the corresponding ChIP-seq data. One-sample analysis identified 3,312 NRSF binding regions before post-processing (FDR≤10%, median length = 269bp), from which the NRSF motif was recovered by de novo motif discovery (see Supplementary Fig. 8 and Supplementary Table 4 online). Motif mapping results (Table 1) showed that among the initial 3,312 peaks, 1,277 contained ≥1 NRSF motif. Boundary refinement greatly reduced the median length of these 3,312 regions (from 269bp to 60bp) with only a slight decrease of the number of NRSF-site-containing regions (from 1,277 to 1,223). The further step of single strand filtering reduced the number of regions from 3,312 to 1,861 but retained most (1,051 out of 1,223) of the NRSF-site-containing regions. The occurrence rate of NRSF sites in the ChIP-seq regions, even before post-processing, was significantly higher than that in ChIP-chip regions (1.26/kb vs. 0.15/kb). The rate was further increased after each step in the post-processing (to 5.54/kb after boundary refinement, and 6.98/kb after single strand filtering). Such increase of signal-to-noise ratio could potentially increase the chance of finding weak unknown motifs by de novo motif discovery in future studies. Predictions with a higher resolution can also provide more focused targets for future experimental studies, such as those seeking the minimal cis-regulatory elements sufficient and necessary to drive target gene expression.
By using both the ChIP and negative control samples, two-sample analysis identified 3,317 initial binding regions (FDR≤10%, median length = 261bp). Post-processing reduced the median region length to 60~70bp and produced a list of 1,794 high quality regions (Table 1). After post-processing, there is a 96% concordance between the peaks detected in one-sample analysis and those detected in two-sample analysis, i.e., their intersection is 96% of their union (Fig. 3a,b).
Comparisons between array and sequencing technologies showed that peak signals produced by the two platforms had a clear correlation (Fig. 3c,d), although peaks called in the tiling array analysis were generally longer than the corresponding ChIP-seq peaks, and the array peaks were less likely to contain the NRSF motif (Table 1). In all studies, binding regions were more likely to be located near promoters (see Supplementary Table 5 online). They were significantly more conserved than randomly selected genomic regions (Fig. 3e), and they were able to cover 10%-13% of all NRSF motif sites in the genome (Supplementary Table 6 online). Noticeably, 5,517 out of 7,114 (78%) array peaks did not overlap with any ChIP-seq peak (Fig. 3a). To investigate whether these regions represent noise in the tiling array technology or signals missed by ChIP-seq, we performed motif analyses. De novo motif discovery was not able to recover the NRSF motif from the array-specific peaks, and only 1.23% (68/5,517) of the array-specific peaks contained ≥1 NRSF motif. As a comparison, 14.1% (1,001/7,114) of all array peaks, 20.9% (290/1,385) of peaks common to the ChIP-seq analyses but not found by arrays, and 58.8% (933/1,587) of peaks common to all three analyses contained the motif. Analyses using non-canonical NRSF motifs yielded similar results (see Supplementary Data 2, Supplementary Fig. 9 and Supplementary Table 7,8 online). Thus in this example the array-specific peaks are not likely to represent true signals.
One-sample design has been used in many ChIP-seq experiments5,9. It allows more biological contexts to be analyzed within a fixed sequencing budget. To study the merits and limitations of this design, we analyzed ChIP-seq data for two additional transcription factors, Oct4 and Nanog, in embryonic stem cells10. Again, there is good agreement between one-sample and two-sample analyses after post-processing – the concordance is 96% in the case of Oct4 and 83% in the case of Nanog (see Supplementary Data 3 and Supplementary Fig. 10,11 online). These examples suggest that one-sample experiment may sometime provide a cost-effective alternative to the two-sample experiment, perhaps at the expense of some specificity.
To gain a better understanding of limitations of one-sample analysis, we applied it to process negative control samples. A small number of peaks were reported at the 10% FDR level even though no peaks should be expected (Supplementary Table 3). This was caused by the residual background variation that the negative binomial model was not able to explain (Poisson model performed even worse) (Fig. 2b). Systematic evaluation using simulated spike-in data shows that, although the one-sample analysis can provide reasonable FDR estimates when the overall binding signal is strong, the method may underestimate the real FDR significantly when the overall binding in the sample is weak (Supplementary Data 1). Fortunately, poor peak reliability and problematic FDR estimation can often be diagnosed through several criteria, such as highly repeat-rich predictions, predictions covering low percentage of reads, and lack of motif enrichment (Supplementary Data 1). Our current recommendation is to use two-sample experiments whenever it is affordable or when little is known about the transcription factor. When one-sample experiment is used because of cost consideration, negative binomial rather than Poisson background model should be used for excluding background noise, and it is important to evaluate prediction quality using multiple criteria as above. CisGenome is designed to support these various types of analyses.
The basic functionalities of CisGenome can be used in combination to address many different biological questions. For example, de novo discovery from peak regions may yield new sequence motifs. Bench biologists can use the motif mapping and statistical summary functions to systematically evaluate the functional implications of these motifs. As an illustration, we studied a novel motif discovered from a Sox2 and Nanog ChIP-chip dataset on human promoter arrays2. This motif (Fig. 4a) was found by de novo motif discovery in addition to the Oct4 and Sox2 motifs37. It was highly sequence-specific but did not correspond to any known motif stored in TRANSFAC42 (see Supplementary Data 4 online). It would be interesting to know whether the motif is functional. To address this issue, we asked whether the motif sites are phylogenetically conserved, whether they function in clusters, and whether their locations are associated with structural features of genes. We applied CisGenome to answer these questions (see Supplementary Fig. 12 online).
Mapping the motif to the human genome yielded a total of 17,740 motif sites, among which 4,543 (25.6%) were phylogenetically conserved. As a comparison, only 16.3% of the non-repeat base pairs in the genome had the same conservation level (see Supplementary Table 9 online).
When motif sites that were physically clustered together were collected, they were >2 times more conserved than non-clustered sites. Among the 1,674 sites that were separated from another site by ≤500bp, 934 (55.8%) were phylogenetically conserved (vs. 4,543/17,740=25.6% of the general sites being conserved) (Supplementary Table 9).
There were 705 clustered conserved motif sites (defined as two conserved sites separated by ≤500bp). Visual examination shows that, for the majority of these sites, strikingly only sequences within the sites were conserved, and the conservation dropped sharply at the site boundaries (Fig. 4c). Moreover, the most conserved positions coincided well with the most informative positions in the motif. Plotting the mean conservation scores for the flanking positions of the motif clearly verified the observation (Figure 4b).
Summary of physical distributions of the motif sites revealed a strong correlation between the clustered sites and promoters (Table 2). While only 1,920 of all 17,740 sites (10.8%) were located within 1kb upstream of a transcription start site, among the 1,674 clustered sites, 835 (49.9%) were within this region. This percentage increased to 59.6% for the clustered conserved sites (420/705).
Repeating the same analyses on the mouse genome produced essentially the same results (Table 2, Fig. 4 and Supplementary Table 9). Thus the motif is highly likely to be a functional promoter element. The strong evidence here indicates that future investigation of the motif is worthwhile, although the context of the motif's function still awaits further exploration (see Supplementary Data 5 and Supplementary Table 10 online).
Compared to commonly used algorithms including MAT11, TAS13 and Tilescope21 amongst others, CisGenome's internal ChIP-chip peak caller provided competitive or higher sensitivity and specificity when applied to the recently published benchmark spike-in datasets43 (see Supplementary Data 6, Supplementary Fig. 13-14 and Supplementary Table 11 online). For the ChIP-seq analysis, the existing tools GeneTrack29 and CPF4 do not provide statistical estimates of FDR. QuEST30 provides FDR estimates only when the negative control sample is available and when the control has twice as many reads as the ChIP sample. SISSRs31 estimates FDR in the one-sample analysis based on a Poisson model. Compared to these tools, CisGenome not only provides high sensitivity and specificity, but also provides better methods for FDR estimation (see Supplementary Data 7,8 and Supplementary Fig. 15 online). In the one-sample analysis, the negative binomial model provides a better model of background. In the two-sample analysis, the conditional binomial model does not pose special requirements on the number of negative control reads.
As summarized in Supplementary Table 12 online, most peak detection tools do not support both ChIP-chip and ChIP-seq analyses and do not support high-level analyses such as motif discovery and peak-gene association. To perform these analyses, traditionally one has to use other tools such as MEME44 and MDSCAN25 for motif discovery and Galaxy45 for linking peaks to gene annotations. For data visualization, IGB has been developed to visualize Affymetrix tiling array data, and SignalMap is a proprietary tool for processing NimbleGen data. Both are platform-specific and do not handle ChIP-seq data. Genome browsers at UCSC and Ensembl are useful for general purposes but are not optimized for handling ChIP data analyses. They do not provide certain functions particularly useful for ChIP data analyses such as visualization of array images and motif logos which are currently processed by independent tools such as WebLogo46. Furthermore, the need to constantly transfer data over the internet makes large-scale interactive data analyses inefficient. Thus, currently to integrate different types of data and conduct various upstream and downstream analyses, the required tools are distributed in a dozen of programs. A large amount of effort is required to reformat output of one piece of software before feeding it to the other. Although web-services such as CEAS28 try to integrate multiple analysis functions, they usually only perform analyses in a pre-defined manner, and there is limited flexibility to customize the analysis to answer the questions of most interest to the user (e.g., analysis of the novel motif illustrated above). In this context, the development of CisGenome has filled an urgent need for a single user-friendly environment with all the basic functionalities for ChIP-chip and ChIP-seq analyses. We believe the availability of CisGenome will significantly enhance the ability of experimental biologists to extract information from their ChIP datasets and from data provided by large-scale efforts such as the ENCODE47 project.
For the interest of space, we only included in the main text the analyses that directly relate to the illustration of CisGenome. Many issues not covered are nevertheless important. These include (1) what are the likely reasons for the observed differences between the NRSF ChIP-chip and ChIP-seq data, (2) whether these differences represent a general phenomenon, (3) how do they relate to previous comparisons of array and sequencing technologies5,48, and (4) what are the different types of negative controls. Further analyses and discussions of these topics are provided in Supplementary Data 9-13 and Supplementary Figure 16 online.
Data used in this study are summarized in Supplementary Table 1 online. The NRSF ChIP-chip data (GEO accession #: GSE8489) were obtained by analyzing the bound DNA fragments in Jurkat cells with Affymetrix Human Tiling 2.0R arrays. Two independent ChIP samples and two mock IPs were profiled. The NRSF ChIP-seq data were collected from a previous study4. In that study, DNA fragments bound by NRSF in Jurkat cells were sequenced with the next generation sequencer made by Illumina/Solexa. These experiments involved sequencing a ChIP'd sample as well as a negative control sample generated from reverse-crosslinked genomic DNA that had not undergone immunoprecipitation. The Oct4 and Nanog ChIP-seq data were collected from .
Most sequencing platforms will output mapped sequence reads up to a specified number of mismatches and will allow elimination of reads that map to multiple locations. CisGenome can accept the mapped reads as input. CisGenome also accepts mapping output from SeqMap49, a program that allows mapping of sequence reads in more customized ways, such as accounting for insertions and deletions (see Supplementary Methods online).
Genome is divided into non-overlapping windows with length w (typically 100bp). The number of reads ni within each window i is counted. It is assumed that in non-binding regions, ni|λi ~Poisson(λi), and λi~Gamma(α,β). This implies that the background read occurrence rate varies across the genome, and marginally ni ~ Negative binomial(α,β). To estimate α and β, a truncated negative binomial distribution is fitted to the number of windows with small number of reads (≤2 reads). We use this estimated null distribution to compute the FDR for each level of read-counts. In the widely used Poisson model, λi is assumed to be a constant λ0 across the genome rather than a random variable. To estimate λ0, we fit a truncated Poisson using the windows with ≤1 reads. The FDR computation and model fitting details are provided in Supplementary Methods online. The fitting method assumes that most windows with small read-counts represent noise. The assumption usually holds true with sufficient depth of sequencing. For studies in which signals cover a large fraction of the genome (e.g., histone modifications) but the sequencing coverage is not deep enough, the true targets may be covered by only 1 or 2 reads in a short window. When this is the case, our model fitting approach may either be applicable after increasing the window size or may not be applicable depending on how long a typical peak extends.
In a specific location, the counts of the reads from the ChIP sample are subjected to biases that may arise during sample preparation, amplification or sequencing procedures. To correct for these biases, one can generate sequence reads from negative control samples in the same experiments. Supplementary Figure 5, 17 and Supplementary Table 13 online show that the read sampling rates from the ChIP and control samples at the same genomic loci are correlated. Therefore, false signals due to unknown systematic bias can be eliminated by excluding regions if both the ChIP and the negative control samples show strong signals but the former is not significantly stronger than the latter. When reads are also available from a negative control sample, we divide the genome into non-overlapping windows with length w. For each window i, the number of reads in the ChIP sample k1i, the number of reads in the control sample k2i and the total read number ni= k1i+k2i are counted. We assume that when there is no IP enrichment in the window, the conditional distribution of the count in the ChIP sample (k1i) given the total count (ni) follows a binomial (ni, p0) distribution. We estimate p0 based on windows with small total counts and use it to estimate the FDR associated with each level of ni and k1i/ni (see Supplementary Methods online).
We scan the genome with a sliding window of width w to detect all windows with FDR smaller than a user-chosen cutoff. Detected windows that overlap with each other are merged into one region. If a region contains more than one overlapping window, the minimal FDR among the overlapping windows is taken as the FDR of the region. In the two-sample analysis, for each sliding window i we also compute a fold enrichment ([yi+1]/[r0*zi+1]) where yi is the number of ChIP reads in the window, zi is the number of control reads in the window, and r0=p0/(1-p0). One is added to both the numerator and denominator to avoid dividing by zero. The biggest fold change among all the overlapping windows within a binding region is recorded as the fold change of the region.
CisGenome uses the counts of 5′ reads and 3′ reads within each candidate binding region to further pinpoint the location of transcription factor binding site within the region (Fig. 2d), and to filter out regions enriched for reads of only one direction based on the assumption that these are unlikely to represent real binding events. Regions that are retained after the boundary refinement and single strand filtering are defined as high quality binding regions (see Supplementary Methods online).
CisGenome uses a two-pass algorithm for peak detection. High quality peaks detected in the first pass will be used to estimate the DNA fragment length, which is computed as the median distance between the modes of the coupling 5′ and 3′ peaks. In the second pass, the reads are shifted towards the center of the ChIP'd fragments by half of the estimated fragment length, and FDR computation and peak detection will be run again on the shifted reads to get the final predictions.
The default choice of window size w=100bp represents a tradeoff between sensitivity and specificity based on the analysis of the NRSF data (see Supplementary Table 14, 15 online). With a smaller w, one can get sharper boundaries of binding regions. However, more noise will be introduced and fewer regions containing the NRSF motif will pass the significance cutoff (FDR≤10%). A bigger w on the other hand may dilute the signals, resulting in a lower resolution of binding region call and a lower percentage of regions that contain the NRSF motif. In future transcription factor studies, one can fine tune the choice of window size w in a similar fashion by using either the known transcription factor binding motifs or motifs recovered from the de novo motif discovery.
To characterize the conservation level of binding regions, CisGenome allows users to first choose a t such that x percent of the whole genome has a phastCons41 score ≥t. For each peak, positions with phastCons score ≥t are picked up, and the average phastCons score for these positions is computed to serve as the peak's conservation level. If a peak has no position with phastCons score ≥t, its conservation level is zero. A high cutoff t (or a small x) will help users focus on the most conserved part of each binding region. To generate Figure 3e, the default value x=10 was used. Peak conservation levels within a tier were averaged. In CisGenome, phastCons score was transformed linearly from [0, 1] to [0, 255] so that each computer byte can store the score for a single genomic position.
This research was supported by NIH grant HG003903 (WHW), and by NHGRI's ENCODE project (RMM). HKJ is partially supported by the JHSPH Richard L. Gelb Cancer Research Fund. The authors would like to thank Dr. Wei Li for providing assistance to analyze the ChIP-chip spike-in data.