Chromatin immunoprecipitation combined with high-throughput sequencing (ChIP-Seq) is currently the method of choice for genome-wide mapping of binding sites for transcription factors (TFs) on DNA [
1-
3]. This is achieved by using DNA fragmentation after DNA-binding proteins have been cross-linked to the genome. The fragmentation is followed by selection of fragments bound by the TF of interest, using an antibody targeting this factor. The selected genomic fragments are then released, sequenced, and mapped to a reference genome. There, the genomic locations bound by the TF will be enriched with matching sequencing reads.
An essential step in the analysis of ChIP-Seq data is the genome-wide identification of enriched regions. Several software tools have been developed to perform this task [
4,
5], but benchmarking has demonstrated that there is room for improvement in the existing approaches [
6]. There are alternative definitions of what constitutes a peak, and useful characteristics of ChIP-Seq profiles may not be fully utilized. A typical example is the shift property [
7], which occurs because the full sequence fragments, typically with an average length around 200 bp, are sequenced for only 25–50 bp from each side. Other examples are the use of independent control samples [
8], separation of overlapping enrichment profiles [
9,
10], or optimal use of statistical approaches to separate signal from noise [
11]. Our conclusion in a previous study [
6] was that different programs focus on different characteristics of the ChIP-Seq data, but that no program takes all the characteristics into account. Though several programs achieve a high coverage of true enrichment profiles, the trade-off has often been an intolerably high false discovery rate (FDR).
A major limitation in the development of improved approaches has been the lack of proper benchmarks to compare the performance of different methods [
12]. Because of this, different strategies for performance evaluation have been used, the most common being motif occurrences in the proximity of the ChIP-Seq peaks [
4,
5,
8,
13], and overlap with experimentally confirmed qPCR sites [
4,
5]. However, both evaluation methods have important limitations [
6,
12], and when looking at a limited number of binding regions, the preferred evaluation method remains visual assessment of local enrichment profiles in a genome browser. To compensate for the lack of benchmarks in ChIP-Seq analysis, we have previously manually curated a benchmark data set based on visual inspection of ChIP-Seq profiles from three different TFs: NRSF (also known as REST), SRF, and MAX [
6]. Individual regions were visually assessed relative to relevant criteria, including peak-like shape, peak shift, lack of signal in control sample, and motif occurrence in peak region, and the regions were classified either as real peaks or as noise or artifacts. The idea was that such a benchmark can be used to evaluate and improve both new and existing efforts for the automatic analysis of ChIP-Seq data for TFs.
In this study, we present an improved approach for automatic identification of peaks in ChIP-Seq enrichment profiles, called the Triform method. Triform uses robust genome-wide statistical tests to detect three different forms of peak-like enrichment profiles, taking advantage of the profile characteristics mentioned above. Overfitting is precluded by the fact that Triform is based on model-free statistical tests and uses a minimal number of preset parameters based on the general properties of the ChIP-Seq data sets. The Triform algorithm is model free in the sense that it relies on the Hoel test [
14] which is fully nonparametric, i.e. free from any assumed relationships or fitted parameters. In particular, the Hoel test is free from any assumed background model and is therefore more robust than model-based tests, which depend on locally uniform background models and fitted background parameters. When evaluated on the benchmark data set of peak profiles, Triform outperformed both newly developed and previously evaluated programs for the automatic detection of enrichment profiles, for all three TFs studied. The good performance of Triform was further confirmed using tests on motif enrichment in significant peak regions.
Since TFs often co-regulate genes that are involved in specific biological processes, we would expect such processes to be overrepresented in the annotation of genes associated with regions for TF binding from the ChIP-Seq experiment. We therefore used statistical overrepresentation analysis on peak sets from the main peak-finding programs that are compared here, and showed that peak sets from Triform in most cases give the most significant overrepresentation of relevant annotation terms.
To illustrate the potential of improved peak finding to generate novel information, we further analyzed the DNA sequence motifs identified de novo within the Triform SRF peak regions. In addition to the canonical SRF motif, we discovered a significant co-occurrence of SRF, ELK1, and NFY motifs within LTR/ERV1/MER57 repeats, and these particular repeats were significantly co-located with genes associated with cancer. This exemplifies how optimal identification of peak regions may generate novel information.