The vendor-supplied analysis package includes an image analysis tool that transforms the pixel values into intensities and a base-calling tool to convert the intensities into sequences. Because the output sequences are short in length and high in error rates, third-party base-calling tools have been developed to increase base-call accuracy and yield [14
]. Alternative tools correct for potential errors after base-calling [15
], and facilitate genome alignment and de novo
]. However, the potential benefit to peak calling for ChIP-Seq data remains unexplored. Short sequence reads, with or without error corrections, are then mapped to a reference genome using a variety of alignment programs. Due to the recent development of alignment tools, short read alignment is no longer a bottleneck in the data analysis process; for a comprehensive summary of various alignment tools, we refer readers to a recent review article [17
The quality of ChIP-Seq data can be inspected using a combination of methods. First, it is important to evaluate the summary report generated by the vendor-supplied analysis pipeline. For example, the “Summary.html” from CASAVA contains a set of comprehensive performance measures for data generated from Illumina GA platforms. The next step involves converting the sequence alignments to an appropriate format, uploading them to a Genome Browser display, and examining several genomic regions of interest (e.g., known targets of a transcription factor). Another qualitative measure for determining the quality of ChIP-Seq data involves searching for sequence motifs within tag-enriched regions or peaks [18
]. In addition, it may be useful to examine the distribution of tag profiles around certain genomic features (e.g., transcriptional start sites). We also suggest to parallelize the inspection using input or IgG controls, and to minimize the bias using specific tools [19
]. As mentioned previously, it is also important to validate selected ChIP-Seq peaks using quantitative PCR.
After the initial quality inspection, peak calling is performed to identify tag-enriched regions from the ChIP-Seq data. Multiple algorithms are available and their comparisons constitutes the themes of several publications [18
]. When considering which tool to choose, it is important to recognize that there are two fundamental types of peaks, sharp and broad. The inspection of tag distributions in a genome browser together with prior knowledge helps to reach an initial idea about what the peaks look like. Algorithms like MACS [20
] work well for identifying sharp peaks of most sequence-specific transcription factors, while programs like SICER [21
] and CCAT [22
] are appropriate for identifying broad peaks of most histone modifications and chromatin binding proteins. Our experience is that CCAT has greater sensitivity for identifying peaks, while SICER has greater specificity. However, because CCAT requires negative controls to estimate noise rates, this algorithm may not be applicable to datasets where negative controls are absent, such as for FAIRE-Seq. Another method, ZINBA, has recently been developed to identify both sharp and broad peaks [23
]. Because these tools are designed for different purposes, a performance comparison between them may not be fair. On the other hand, efforts made to evaluate these methods have been limited largely due to the absence of objective benchmark standards [18
Reads mapped to multiple sites (multi-reads) are discarded during "normal" analysis. Consequently, peaks in highly repetitive regions are overlooked. However, repetitive regions have been linked to important biology functions such as disease susceptibility, immunity and defense. A new method has recently been proposed to incorporate multi-reads with a weighted alignment scheme into peak detection [24
]. Since most of the novel peaks reside in repetitive regions, this method will be of particular interest to the analysis of ChIP-Seq data from proteins that selectively bind to repetitive regions.
Another important issue in data analysis is how to compare the levels of histone modifications or transcription factor binding between two different cell types or under different conditions. Due to variations in ChIP conditions, the level of noise may vary significantly between different samples even for the same antibody. Because scaling the data to sequence depth does not eliminate systematic errors, normalization algorithms are needed to enable comparisons across samples. A recent tool, DIME, has been developed to classify significantly enriched regions between two ChIP-Seq samples based on an estimation of multivariate mixture models [25
]. Because the method partitions the genome into bins that are larger in size than a typical TF binding site, it may serve as a first-pass algorithm to identify candidate differential binding regions, especially in cases of low sequencing depth. A sub-module of SICER is also able to identify differentially enriched regions between two conditions when regions of interest are specified (e.g., changes in the levels of histone modifications at promoters between two samples).
Because genome-wide data are being generated at a rapid rate, it is important that analytical tools are developed at a similar rate to support the storage and analysis needs that users encounter. We expect a flurry of software with user friendly interfaces (GUI) to be released in the near future and adopted by biologists with diverse backgrounds.