The appearance of next‐generation sequencing technologies has propelled forward the development of new techniques among which ChIP-Seq has become an important method for genome-wide discovery of binding sites for DNA-associated proteins and in particular for TFBSs. ChIP–Seq consists of the immunoprecipitation of protein–DNA complexes followed by massively parallel sequencing of short ends of immunoprecipitated DNA (
1–3). This technique succeeded the ChIP-on-chip technique (
4) and has nearly replaced the latter because of the increased accuracy in identification of TFBSs (
2).
At the completion of a ChIP-Seq experiment, millions of short (~35–50 bp) directional DNA tags are obtained, which can be positioned or aligned to the reference genome for the sample organism (
Supplementary Figure S1). Each short tag represents an extremity of a longer DNA fragment (~200–400 bp depending on the experiment) isolated from the immunoprecipitation. Thus, in the analysis of the short representative tags, it is important to take this experimental fact into consideration to identify the full length of the original fragment that gave rise to the tag. By extending each tag, it is then possible to identify areas of overlap, which represent the location of the protein binding event. The density profile of DNA fragment coverage can then be calculated and ‘peaks’ corresponding to putative binding sites can be extracted. This idea was elegantly implemented in the FindPeaks software (
5). However, the accuracy of peak calling can be considerably improved by incorporating information about genomic sequences of peaks in addition to coverage depth information.
In this article we present an algorithm implemented in the MICSA software (Motif Identification for ChIP-Seq Analysis) that is based on the idea that functional binding sites of transcription factors (TFs) should contain a consensus motif (or a set of motifs). Consensus motifs are the composite sequences of DNA for which a DNA-binding protein, such as a TF or restriction enzyme, has a high affinity. Such motifs can be identified from the small subset of peaks with a high DNA fragment coverage.
The MICSA algorithm is innovative in the context of ChIP–Seq data analysis for simultaneous: (i) de novo TFBS motif identification and (ii) functional binding site prediction using information about motif occurrences in peaks along with coverage depth information. Here, motif identification is not a post-processing step as in other ChIP-Seq analysis pipelines (
6) but a key element which allows keeping even low peaks if they have a strong motif occurrence.
Since MICSA checks for motif occurrences in all peaks including those with very low coverage depth, there is no need in the explicit selection of threshold on DNA tag/fragment coverage. The only parameter that remains to be specified is the maximal number of expected false positive hits among selected peaks or the maximal false discovery rate (FDR).
Using the procedure developed by Kharchenko
et al. (
7), we compared the peak identification performance of MICSA and 10 other published tools (
5–14). The dataset selected for the comparison was generated by Johnson
et al. (
2) for the neuron-restrictive silencer factor (NRSF). MICSA showed a considerable increase in the performance over 10 other approaches. To increase the statistical basis we performed the same comparison procedure for selected algorithms on other ChIP-Seq datasets, including those for GA-binding protein (GABP) (
10), signal transducer and activator of transcription 1 (STAT1) (
9) and CCCTC-binding factor (CTCF) [ENCODE project, the Broad Institute and the Bradley E. Bernstein lab at the Massachusetts General Hospital/Harvard Medical School (
15)]. The results of the comparison indicated that use of MICSA for ChIP-Seq data analysis allows us to significantly reduce the number of false positive predictions for TFBSs.
The MICSA package was also used on our ChIP-Seq data (
16). Immunoprecipitation was performed with a specific antibody directed against the oncogenic TF EWS‐FLI1 (
17) to obtain biological insight into the functioning of this TF, which is known to be the major oncogene in Ewing sarcoma. Using our technique, based on motif identification, we confirmed the existence of two consensus motifs, one representing a (GGAA)
n microsatellite, and the second containing the RCAGGAARY consensus sequence (
16) (R = A/G, Y = T/C). Further analysis of the EWS-FLI1 data, together with expression arrays, suggested that EWS-FLI1 bound to (GGAA)
n microsatellites can activate transcription of neighboring genes; while EWS-FLI1 bound to RCAGGAARY sites may, depending on genes, activate or repress transcription. Our analysis confirmed five known direct target genes of EWS-FLI1 and has also predicted many new genes that are putatively regulated directly by EWS-FLI1.
The algorithm we developed is pioneering in its use of motif information when predicting sites of specific binding for TFs from ChIP-Seq data. It allows identification of several motifs which, as shown at the EWS-FLI1 example, can possibly carry different biological function.