ZINBA overview
ZINBA performs three steps: data preprocessing, determination of significantly enriched regions, and an optional boundary refinement for more narrow sites (Figure ). The first step involves tabulating the number of reads falling into contiguous non-overlapping windows (default 250 bp) tiled across each chromosome and scoring corresponding covariate information. Covariates can consist of any quantity that may co-vary with signal in a given region, including, for example, G/C content, a smoothed average of local background, read counts for an input control sample, or the proportion of mappable [
28] bases, which we define as the mappability score (Materials and methods). Optionally, additional sets of contiguous windows with offset starting positions can be tabulated for increased resolution. Each set of offset windows is analyzed independently in the next step.
In the second step, a novel mixture regression model is used to probabilistically classify each window into one of three components: background, enrichment, or zero-inflated. In this context, and throughout the manuscript, the term 'enrichment' will refer to genomic DNA sequences that were captured specifically as the result of the biological experiment under consideration. The term 'background' includes genomic DNA sequences that appear due to experimental noise, noise that arises in the sequencing process, or noise that arises in the computational processing of the data. The term 'zero-inflated' refers to those genomic locations at which we might expect coverage by a sequencing read derived from either the background or enrichment signal components, but that are not represented in the real data. Zero-inflation typically occurs due to a lack of sequencing depth and is common in many NGS datasets. Regions containing higher proportions of non-mappable bases are also more likely to be zero-inflated, as it is more difficult to assign reads to these regions during the mapping process.
ZINBA utilizes an iterative approach [
29] to determine for each window the relative likelihood of belonging to each component, in addition to estimating the relationship between average signal in each component and a set of covariates (Materials and methods). Each iteration consists of two steps. In the first step, a set of posterior probabilities of component membership is computed for each window, based on how well each window fits with the average signal level in each component, adjusted for covariate effects. In the next step, the average signal level in each component is modeled separately with its own formulation of covariates using weighted generalized linear models (GLMs). The posterior probabilities of component membership are used as regression weights and serve to partition the genome into likely background, enrichment, and zero-inflated regions to determine component signal. The model iterates between these two steps until the classification and component-specific covariate estimates cease to change.
Adjusting for covariate effects is often beneficial or necessary for dissecting enrichment regions and background. For example, although signal in background regions is typically lower than in regions of enrichment, background regions in copy-number amplified regions may have higher signal than enrichment regions that occur in locations with a normal DNA copy number. Thus, adjusting for copy number changes is necessary for correct separation of background and enrichment regions. The set of covariates used to model each component can be selected based on either prior knowledge or an information criterion, such as the Bayesian information criterion (BIC). Covariates with no or weak relationships with mean signal in a component will have little effect on classification, but do contribute to model complexity. The BIC criterion helps to remove such covariates to balance model fit and model size.
In the third step, all overlapping or adjacent windows classified as enriched are merged. For the detection of broader elements, especially helpful for histone modifications demarcating broad genomic regions (such as H3K36me3), an additional 'broad' setting is available that merges enriched windows within a fixed distance. An optional shape-detection algorithm may then be applied to identify sharp enrichment signals within broader enriched regions.
Modeling signal components with relevant covariates improves enrichment detection
To evaluate the utility of incorporating covariate information for the detection of enriched regions, we constructed simulated datasets, and used G/C content as one example of such a covariate. Simulated datasets were constructed to artificially control the relationship between G/C content and the enrichment, background, and zero-inflated components. Window count data were simulated to represent three types of common NGS signal patterns, ranging from TFBSs (high signal-to-noise ratio, 1% of genome belongs to enrichment component), FAIRE (moderate signal-to-noise ratio, 5% of genome belongs to enrichment component), to some histone modifications (low signal-to-noise ratio, 10% of genome belongs to enrichment component). For each data type, three sets of data were simulated, hence nine datasets in total. In each data set, G/C content always had a positive relationship with signal in the background component and a positive relationship with the probability of being zero-inflated. However, G/C content was simulated to have either a positive, neutral or negative relationship with enrichment. For each of the nine datasets, 100,000 windows were simulated. These consisted of 250-bp windows from human chromosome 22 (Materials and methods). G/C content was simulated from these windows as well.
Now, for each of the nine simulated datasets, three different uses of the covariate were employed to model the simulated data: (a) model 1, no covariates; (b) model 2, G/C content is incorporated in modeling the zero-inflated and background components only; (c) model 3, G/C content is incorporated in modeling all three components.
Our results show that models that properly accounted for the underlying simulated relationships with G/C content in each component resulted in the best classification outcomes. For example, when enrichment had an inverse relationship with G/C content (Figure ), model 3 consistently led to higher sensitivity and specificity relative to models 1 and 2 (Figure ). Simulated component-specific relationships between G/C content and signal were also correctly captured in model 3 (Figure ), with average enrichment signal decreasing and average background signal increasing with respect to G/C content. Ignoring the role of G/C content completely (model 1) resulted in classification based purely on signal, which misses informative trends in the data (Figure S1 in Additional file
1). We find similar results for the simulated condition of positive and neutral relationships between G/C content and enrichment (Figures S2 and S3 in Additional file
1). Thus, including relevant covariates to model each component provides a more informed assessment of enrichment versus background.
These results also serve to illuminate how ZINBA distinguishes the separate roles of component-specific covariates. For example, covariates that are relevant to the background component explain variability in background signal that may otherwise be confused for enrichment. This benefit of ZINBA is more apparent when the signal-to-noise ratio is low (Figure ) because, in that case, many background and enrichment windows contain similar numbers of reads, and the two states are difficult to distinguish by signal alone. In the situation where we simulated a neutral relationship of G/C content with enrichment, model 3 had similar performance to model 2, suggesting that the use of G/C content to model the enrichment component did not degrade classification performance. Rather, the estimated effect of G/C content in the enrichment component was close to zero, and thus had little effect on classification (Figure S2 in Additional file
1) at the cost of greater model complexity.
While we chose to simulate our data in this section with respect to only one covariate, the regression basis for the mixture model allows the inclusion of multiple covariates simultaneously, as is inherent in any regression-based framework. Regardless of whether the data consist of rare, high signal-to-noise enrichment or common, low signal-to-noise enrichment, the model performs better when each component is modeled with relevant sets of covariates. However, the performance gain when using relevant covariates is greatest in lower signal-to-noise data.
Automated model selection
Relevant covariates are not always known
a priori. To discover the appropriate formulation of covariates for each component, ZINBA employs the BIC [
30] to select the best model among all possible models, given a set of starting covariates (Materials and methods). BIC balances model fit and model complexity and has long been employed as a statistical assessment of model performance. The regression framework inherent in ZINBA also allows for the modeling of interactions between covariates. Therefore, all pair-wise and three-way interactions between the starting covariates for each component are considered in the model selection procedure. The automated model selection procedure was able to select the most appropriate model for all nine simulated conditions from the previous section.
ZINBA detects relationships between covariates and component signal that vary by experiment
Evaluation of the relationships between the set of component-specific covariates selected using the automated model selection procedure and the datasets shown in Figure [
31,
32] revealed that our mappability score and input control were positively related with mean background signal in each ChIP-seq dataset, which is consistent with previous reports [
5,
28]. Each dataset exhibits distinctly different degrees of signal-to-noise ratio, length of enriched regions, and total proportion of the genome enriched. These differences can be attributed to both functional differences related to biological activity and technical aspects of the different assays. However, the relationship between G/C content and background signal was not consistent between different DNA-seq experiments (Table S1 in Additional file
1), nor were they consistent between components of the same dataset.
For the RNA Pol II and CTCF data, model estimates reveal that G/C content had a positive relationship in background regions, similar to previous reports on G/C content bias [
24-
26] (Figure ). However, in FAIRE-seq data, G/C content was negatively associated with the background component (Figure ). These differences can easily be observed from scatter plots of the raw read counts from windows classified as background versus the corresponding G/C content for the RNA Pol II ChIP-seq and FAIRE-seq datasets (Figure ). The exact cause of the differences in the relationship between G/C content and background signal between datasets, and whether it could be technical or biological, is not known.
The relationship for each covariate also differed in magnitude and direction across components of the same dataset. For example, in FAIRE-seq data, while there was a negative relationship with G/C content in background regions, there was a positive relationship in enriched regions (Table S1 in Additional file
1). A similar difference between the relationship of G/C content in the background and enrichment regions was found for the RNA Pol II ChIP-seq data. Thus, the relationships of covariates with background signal may not be consistent across different data types, and may differ in their relationships to signal in background and enrichment regions of the same data type.
An input control may be used to account for the relationships of G/C content and mappability with background signal. However, the model estimates suggest that input data alone may not explain all of the variability in DNA-seq background. Examination of the relationships of covariates with input signal and DNA-seq background reveals differences in the effects of covariates within each (Figure S4 in Additional file
1). In the case of RNA Pol II (Figure S4a, b in Additional file
1) and CTCF (Figure S4c, d in Additional file
1), where the estimated relationship of G/C content with background DNA-seq signal is positive, in the matching input control sample the relationship with G/C content is relatively neutral. The reason for these differences is currently unknown, but may be related to sample handling differences between the ChIP and input samples.
Incorporation of a covariate for copy number allows peak calling within amplified genomic regions
One challenge for the analysis of DNA-seq data is fluctuations in background signal resulting from copy number variations (CNVs). If not properly accounted for, such changes in background can result in significant false positives. This is especially true if there are no input control samples for comparison, or if the input control samples are insufficiently sequenced. To account for this, we constructed a new covariate to measure local background, and included this covariate in our mixture regression framework to account for local copy number changes. Changes in background signal levels due to CNVs were estimated locally using the DNA-seq sample itself, supplemented by a change-point detection method to determine boundaries of likely CNVs (Materials and methods). Application of this approach provided an accurate estimation of signal changes due to local CNVs in a FAIRE-seq MCF-7 dataset, which is aneuploid and has extensive CNVs [
33] (Figure ).
Using a BIC-selected model considering the local background estimate, G/C content, and mappability score as starting covariates, we found ZINBA was able to correctly classify background regions within CNVs (Figure ) and called 8 and 11 times fewer peaks (1,258) using a FAIRE-seq dataset in MCF-7 CNV regions in chromosome 20 [
34] relative to MACS [
5] and F-seq [
35] (Figure ). Incorporation of this covariate also leads to the better recovery of relevant peak regions within ENCODE [
36] datasets, as we demonstrate in later sections.
Estimation of local background from the experimental data is only effective when local background is sampled from a sufficiently large window size, where these large windows (default 100 kb) will not be dominated by enriched signal. This is the case with the majority of data types, as most contain enriched features that span no more than several kilobases. In any case, the flexibility of ZINBA allows for CNV estimates from any source to be included into the model selection procedure and determination of enrichment. ZINBA also includes a 'CNV mode', which can be run on input DNA for a quick estimation of the extent of amplified genomic regions in a given sample. This mode utilizes 10-kb windows in the ZINBA mixture model without any covariates, aiming to detect extended region enrichment of input reads.
Evaluation of ZINBA over a wide range of signal patterns and amplitudes
We selected a variety of DNA-seq datasets, including FAIRE-seq, CTCF, RNA Pol II, and H3K36me3 ChIP-seq, to compare the performance of ZINBA with other existing methods across a range of signal-to-noise ratios, patterns of enrichment, and proportion of total genomic enrichment. For example, CTCF ChIP-seq data exhibit punctate, high signal-to-noise ratio peaks, FAIRE-seq data have broader, low signal-to-noise ratio peaks, and RNA Pol II ChIP-seq data contain a mixture of punctate high signal-to-noise and diffuse low signal-to-noise peaks. H3K36me3 enrichment encompasses very broad domains of many kilobases, extending over large portions of transcribed regions. For each dataset, we applied the automated model selection tool to determine the set of component-specific covariates to model each dataset (Materials and methods).
ZINBA was compared with MACS [
5] and F-Seq [
2], which represent two classes of peak calling algorithms that also do not require an input control sample to call regions of enrichment. MACS [
5] represents a class of algorithms that uses a sliding window approach for the detection of enriched regions compared to a matching input control sample or local background estimate. F-Seq [
17] represents a class of algorithms that use kernel density estimation to estimate local read density and identifies enriched regions as those with a kernel density estimation larger than a user-defined threshold, which is estimated using simulations assuming random assortment of sample reads.
For each algorithm, the top N set of ranked peaks (500, 1,000, 2,000, and so on) were selected. The performance of each was evaluated by calculating the average peak length, the proportion of peaks overlapping a set of biologically significant features (within 150 bp) and the average distance to these features. For ZINBA, the set of unrefined peak calls (merged enriched windows) and refined peak calls (boundaries of punctate peaks within merged regions) were evaluated separately to determine their relative utility in each dataset. For the H3K36me3 data, we utilized the ZINBA 'broad' setting (Materials and methods) to capture regions of enrichment that may extend for many kilobases.
All algorithms perform comparably for the analysis of punctate high signal-to-noise datasets
For the CTCF ChIP-seq data set, the set of ranked peaks for each algorithm was compared to the occurrence of the CTCF motif (JASPAR motif MA0139.1). The genome-wide set of motifs was identified using FIMO, part of the MEME suite [
37], with default parameters. All of the algorithms were able to identify a high proportion of sites containing the CTCF motif (Figure ) and had comparable peak lengths (Figure ). Positioning of peaks called by ZINBA was slightly closer to the CTCF motifs (Figure ). These results are consistent with other comparisons of ChIP-seq peak calling algorithms [
17], which revealed few differences in sensitivity and specificity when applied to high signal-to-noise ChIP-seq data. Of the 50,228 refined peaks called by ZINBA, 95.2% were in common with MACS (60,135 peaks) and 99.9% were in common with F-seq (276,879 peaks).
The set of broad and punctate peaks identified by ZINBA for RNA Pol II ChIP-seq data reflects the elongation status of the polymerase
One unique feature of RNA Pol II ChIP-seq data is that enrichment consists of both punctate high signal-to-noise ratio peaks at transcription start sites (TSSs) and broader, low signal-to-noise peaks into the body of genes [
4]. All of the algorithms were able to capture a large proportion of annotated TSSs (Figure ; Figure S5a in Additional file
1). However, the set of refined peaks called by the shape detection algorithm within ZINBA resulted in a set of narrower peaks much more closely associated with the TSSs of genes (Figure ) compared with MACS, F-Seq, and unrefined ZINBA peak calls. A relatively high degree of overlap can be seen between each of the peak sets, although the overlap is not as strong compared to those observed for the CTCF dataset (Figure S5b in Additional file
1).
The ability to produce both a refined (punctate) and unrefined (broad) set of peak calls using ZINBA provides an opportunity to infer elongating versus stalled RNA Pol II. For the case of stalled RNA Pol II, one would expect a punctate peak at the TSS, but no broad peak within the body of the gene [
38]. Under this expectation, we computed a 'stalling score' (Materials and methods), where smaller values correspond to a broad high-amplitude signal across the gene, and larger values to a punctate signal near the 5' end of the gene and lower-amplitude signal along the gene body. Previous computations of RNA Pol II stalling scores utilized a height ratio between the punctate peak at the TSS and the median height of the broader region [
39] (Figure S6a in Additional file
1). Using ZINBA, our stalling score further incorporates the lengths of the broad and punctate enriched regions found in the experimental sample. The stalling index had a strong negative relationship (
P-value < 10
-10) to the expression of the nearby gene (Figure S6b in Additional file
1) and explained more of the variance in measured gene expression (R
2 = 3.5%) than a score utilizing only the ratio of punctate to broad signal height (R
2 = 0.04%). The ability to calculate this metric reflects one potential use of the peak boundary refinement module within the ZINBA framework.
ZINBA accurately identifies regions of enrichment in low signal-to-noise datasets without the use of input for background estimation
FAIRE-seq [
3,
40] differs from ChIP-seq in that it is an antibody-free method that recovers DNA fragments that are relatively resistant to formaldehyde crosslinking to proteins. The crosslinking profile of chromatin is likely dominated by histone-DNA interactions, and therefore the sites preferentially recovered by FAIRE correspond to sites of nucleosome depletion. On average the size of each FAIRE site corresponds to the loss of approximately one nucleosome (200 to 300 bp). Compared to the binding events identified for TFBSs by ChIP-seq, the FAIRE-seq sites tend to have much lower signal-to-noise, have a slightly broader pattern of enrichment, and encompass a larger proportion (1 to 2%) of the genome. In addition, input control is often not available. Therefore, many of the assumptions utilized by existing algorithms, especially for the analysis of TFBS ChIP-seq, are not well-suited to the analysis of this data type [
22].
We analyzed a K562 FAIRE-seq dataset lacking a matching input control sample with each algorithm, and compared the resulting set of peaks from each algorithm to a set of DNase I hypersensitivity sites (DHSs) [
31,
32] isolated from the exact same set of cells. The DHSs were called by F-seq, and were selected as a standard because of the longstanding use of DNase as a method for identification of open chromatin sites. Both ZINBA and MACS called a high proportion of FAIRE sites that overlapped a DHS, but a low proportion of FAIRE sites called by F-seq were localized to a DHS (Figure ). The set of sites called by both MACS and F-Seq tended to be longer and more errant in K562 CNV regions [
31,
32] (Figure S7a in Additional file
1), where approximately 37% of MACS and 27% of F-seq peaks were localized to a DHS, compared to 50% of ZINBA peaks. Overlap between called peak sets from ZINBA, MACS, and F-seq for FAIRE were more disparate than those found in high signal-to noise CTCF data (Figure S7b in Additional file
1).
Open chromatin regions tend to have strong correspondence to active regulatory elements and promoter regions of expressed genes [
40]. Comparison of the set of ZINBA RNA Pol II and FAIRE-seq refined peak calls yielded a significantly higher degree of overlap compared to the other algorithms (Figure ), indicating consistency in ZINBA peak calls across data types.
ZINBA captures broad patterns of enrichment
The deposition of H3K36me3 is mediated by enzymes that travel along with RNA Pol II during transcriptional elongation, and therefore this histone modification typically occurs in broad segments encompassing a large proportion of gene bodies [
41]. Utilizing the 'broad' ZINBA setting (Materials and methods), the H3K36me3-enriched regions identified by ZINBA correspond to the broad patterns of enrichment covering actively transcribed gene bodies, as expected.
On average, 80% of the lengths of the top 'N' most active UCSC gene bodies were covered by the set of H3K36me3 ZINBA peaks (Materials and methods; Figure ). A lower level of gene body coverage was found from other methods. Of the 40,180 H3k36me3 merged ZINBA peaks, 71% overlap a gene body, compared with only 59% of F-seq peaks merged in a similar fashion, suggesting higher specificity of these broad ZINBA regions to gene bodies. Of the set of ZINBA merged peak calls that overlapped a gene body, the median and 75th percentile of peak lengths was 5,374 and 18,370 bp respectively, indicative of the broader set of features that are being called (Figure S8 in Additional file
1).
Within the set of H3K36me3 enrichment regions identified by ZINBA, those that overlap ZINBA RNA Pol II broad regions also contain significantly higher levels of RNA expression compared to those that do not overlap broad RNA Pol II regions (Figure ). Approximately 85% of ZINBA H3K36me3 broad regions that overlap a ZINBA RNA Pol II broad region contain non-zero RNA-seq signal (7,585 out of 8,873 overlapping regions), compared to only 58% of those that do not (18,134 out of 31,312 non-overlapping regions). Furthermore, of ZINBA H3K36me3 regions with non-zero RNA-seq signal, those that overlapped a ZINBA RNA Pol II broad region had three-fold higher median RNA expression. The relationships we observe among our ZINBA calls recapitulates the biology of H3K36me3, where higher levels RNA Pol II activity correspond to higher levels of RNA transcription and histone modification (Figure ).
ZINBA performs comparably with or without input control data
Comparison of ZINBA peak calls from BIC-selected models considering input as a covariate versus those that do not reveal similar performance in isolating relevant enriched regions. For example, 94% of the CTCF ChIP-seq peaks discovered using a model that included input (Table S1 in Additional file
1) were held in common with a model considering only G/C content, mappability score, and the local background estimate as starting covariates. Recovery of sites overlapping a CTCF motif was also very similar (Figure S9a in Additional file
1). This similarity in performance with and without input extended to the lower signal-to-noise H3K36me3 ChIP-seq data (Figure S9b in Additional file
1). Because of the broad nature of H3K36me3 enrichment, we only considered G/C content and the mappability score as potential covariates in the no-input model. These results demonstrate the ability of ZINBA to distinguish regions of enrichment from background in the absence of input control.
Modeling enrichment covariates is especially beneficial in low signal-to-noise data
Choosing not to model covariates in the enrichment component (Table S1 in Additional file
1) resulted in almost uniform decreases in model confidence in the classification of 'enriched' windows relative to when enrichment covariates are considered (Figure S10 in Additional file
1). This is especially severe in the low signal-to-noise FAIRE and H3K36me3 dataset (Figure S10a, b in Additional file
1), in contrast to the higher signal-to-noise CTCF data (Figure S10c Additional file
1). In H3K36me3 data, significantly fewer windows in chromosome 22 were classified as enriched over background (posterior probability of enrichment greater than 0.5) when enrichment covariates are ignored. Applying this model genome-wide, we find that 60% fewer windows were called at the default threshold prior to window merging, and post-merging we observe much lower coverage of active gene bodies (Figure S9a in Additional file
1), in contrast to CTCF peaks, which change little as a result of ignoring enrichment covariates (Figure S9b in Additional file
1). These results and the simulated data suggest that utilizing covariates provide an increased discriminatory power for distinguishing background and enriched regions, especially in low signal-to-noise data or when information such as an input control is lacking.