Results of the application of RSSPA for detection of histone acetylation and RNA polymerase II occupancy sites are described here. The experiments performed in the HL-60 – an acute myeloid leukemia – cell-line, explore the interaction of DNA with (i) tetra-acetylated histone (HisH4), and (ii) RNA pol II [
35]. HL60 is stimulated with all-trans-retinoic acid for distinct time periods, to induce differentiation along the granulocytic lineage. 0, 2, 8 and 32 hrs constitute the time-course in this experimental design. The site prediction algorithm is applied per time-point and differential modification analyses (data not presented here) are performed subsequently. HisH4 is a histone modification factor associated with active genes. RNA pol II is the nuclear RNA polymerase responsible for mRNA transcription. In eukaryotes, unlike prokaryotes, the normal or ground state of the chromatin is restrictive to transcription. In the repressed state the enhancer and promoter elements are covered by nucleosomes. This state can be converted, via acetylation, methylation and recruitment of chromatin remodeling factors, into a transcriptionally poised state that is prepared for binding to RNA pol II and TFIID proteins [
36]. Both HisH4 and RNA pol II are hybridized to Affymetrix ENCODE [
26,
27] tiling arrays of 22 bp (average) probe resolution and 10μ feature resolution. The ENCODE array samples approximately 1% of the human genome and does not include regions from chromosomes 3 and 17.
RSSPA has been implemented for detection, ranking and segmentation of enrichment sites across a spectrum of ChIP on chip experiments. These range from chromatin remodeling factor (Brg1), sequence-specific DNA binding proteins (CTCF, CEBP/ε), histone modification factors (HisH4: acetylation, H3K9K14D, H3K27T: methylation), to factors with known 5' end biases (RNA pol II, TFIID). Data on the above factors are being published as part of the ENCODE Consortium effort [
37,
38]. The HisH4 and RNA pol II data, a subset of the ENCODE data, are discussed here since they represent contrasting enrichment profiles – in terms of the base pair coverage of their binding footprints. The results will focus on:
i) Common parametric approaches for detection of enrichment sites
ii) The underlying mechanism and efficacy of the proposed non-parametric RSSPA
iii) Simulation results
iv) Biological examples
v) Validation results derived from alternative computational and biochemical approaches
Binary segmentation for detection of enrichment sites
This simple, intuitive approach validated via quantitative PCR (qPCR), is effective in the identification of sites with relatively strong enrichment signal and statistical confidence. While ensuring low false positive rates, this method can suffer from a significant false negative bias, especially in regions of diminished signal enrichment. A span of DNA sequence is computationally labeled a non-site (negative) if it fails the stipulated signal or p-value threshold. This binary outcome does not reveal whether the absence of a site is due to a potential false negative caused by failing the threshold by a minor margin, or is a true negative caused by a span of DNA with very low significance and negligible IP enrichment.
The intrinsic noise in ChIP on chip can reduce the SNR in the data and result in a globally lower p-value and/or signal enrichment distributions. In these circumstances, a computationally determined negative might be positively validated by qPCR and/or other biochemical means. Thus in experiments with high variance and/or reduced binding efficiency, a significant false negative bias can be introduced, resulting in inflated discordance across replicate experiments, as demonstrated by Fig. . The figure shows, for three replicates, the pScore profile across 43 contiguous positive probes that constitute a putative site. At this site only two thirds of the biological replicates exceed the pScore threshold of 50(Wilcoxon p-value = 10-5) for a set of contiguous probes, potentially indicating variable levels of sensitivity in the experiments. The overall pScore trend is consistent across replicates, hinting at the presence of a putative site. However, if the replicate with the lowest pScore distribution (blue curve) were the only dataset available and the binary segmentation with a pScore threshold of 40, the method of choice then this site would become a false negative. In the absence of independent corroboration it would be difficult to discriminate the putative target (in the lowest sensitivity experiment) from an artifact of spatial auto-correlation.
Threshold estimation controls the sensitivity versus specificity of the binary segmentation approach. Each of the thresholding mechanisms – whether of fixed or distributional type – introduces a different type of bias to the analysis, examples of which are shown in Fig. . The figure shows box-plots of the pScore distribution of three biological replicates (B1-B3) and their replicate composite, following binary segmentation. Fig. (upper panel) contrasts site detection based on a fixed versus distributional pScore threshold derived from replicate B1. The two pScore thresholds used are: (a) fixed: 50 (pScore ≥ 50); and (b) distributional: 99th percentile of the B1 distribution. Once the distributional threshold is derived from a particular replicate, the same value is applied uniformly across all replicates. The binary segmentation after fixed thresholding show that the sites in B1 have a minimum score of 50 (as expected), while the pScore distribution of these exact sites in replicates B2, B3, and the composite, manifest a significant range from 0–200. For the distribution-derived threshold, the detected sites have a pScore range of 170–230(B1), 0–50 (B2-B3), and 50–150 (composite). This result highlights disparity in the score-distribution across replicate biological samples and demonstrates that fixed and distribution-derived thresholds might not detect identical sites across replicates, resulting in increased disparity among replicate experiments. An alternative to using individual replicates is to generate the fixed/distributional thresholds based on the replicate composite. But, as shown in Fig (lower panel), this hardly mitigates the disparity. While the choice of a composite over an individual replicate does not improve the performance of the method, the choice of a distributional over a fixed threshold reduces the variance in the pScore distribution of the putative sites. This is evident from the maximal compression of the inter-quartile range observed in Fig (top-right panel).
Rank statistics based algorithm for detection of enrichment sites
A parametric binary segmentation paradigm has the potential to introduce a significant false negative bias. This bias discriminates against sites with moderate to low binding-enrichment, or poor probe behavior. RSSPA employs a rank and replicate statistics-based paradigm to mitigate these biases. The following sections discuss the results from each of the components of RSSPA.
Step 1 – Seeding of sites based on binary segmentation of data
Results have been generated based on both seeding parameters – p-value and SE. The site seeding is potentially more robust if based on signal enrichment, rather than p-value distribution. Since the latter is affected more significantly by spatial auto-correlation. The final results summarize the correlation obtained between the two seeding parameters. Fig. shows an example of site-seeding, based on a pScore threshold of 20, across five biological replicates in the HisH4 data. The amplitude of the graphs (blue) represents the pScore distribution in a specific genomic region where all replicates manifest similar ChIP-enrichment (all graphs have been scaled to common maximum for clarity of visual representation). The top-most track (green) represents the union of the site intervals as derived from the individual replicates.
Step 2 – Optimization of sites based on centrality, variance and error distributions
The optimization based on the simultaneous minimization of the p-value based covariates – μ, SAD and ε – and maximization of SE forms the corner-stone of the algorithm (Steps 2–3). Fig. summarizes the rank consistency distribution for putative sites. It demonstrates density plots of pair-wise rank-difference distributions for replicate data derived from chromosomal segments with contrasting gene density. Poor gene density (chromosome 1) is shown at left, higher gene density (chromosome 10) is shown at right. Based on triplicate experiments, the absolute rank-difference distribution is computed for all six pair-wise replicates – {Bi, Bj} and {C,Bi} – where, i and j refer to replicates, l ≥ i(j) ≥ 3 and i ≠ j, and the replicate composite (C). Data for the C-B1 pair is shown in black, C-B2 in blue, C-B3 in red, B1-B2 in brown, B1-B3 in cyan and B2-B3 in magenta. For visual clarity, the x-axis is scaled by 100, data-points are shown for one curve (C-B2) and for others a spline-fit to the data has been represented. The primary observations here are:
i) The absolute rank-difference curves do not trace a delta function about 0 (ideal case) or even an exponential decay with a peak at 0, but rather a gamma distribution with the mode slightly greater than 0. This observation holds true across all chromosomal regions. The off-zero mode indicates that very few sites show perfect rank consistency. This is due to inherent noise in the ChIP on chip process. However, the bulk of the population of seed sites maintains very high rank consistency across replicates. This validates a fundamental assumption of the model.
ii) The skewness of the distribution is attributable to the population of sites whose rank order correlation across replicates gradually diminishes. More than 95 percent of these sites have a high likelihood of inherently poor intra-replicate ranking (data not shown).
iii) The SAD distribution is estimated based on seeded sites which include all possible ChIP enrichment intervals that exceed a SNR of 1.1. Potentially a high percentage of these sites could be false discoveries. The FDR could be reduced by re-adjusting the parameters of the normal distribution (N(μ,σ2)) based on the underlying gamma distribution:
(a) Setting the estimated mean (μ) to the mode of the gamma distribution;
(b) Estimating the variance (σ) by symmetrizing the left tail of the gamma distribution.
The proposed modification in the estimation of the normal distribution would filter out sites with high SAD values. In order to explore the response of RSSPA to various noise sources, results presented here do not incorporate this correction.
iv) Independent of gene density – as observed from data across both panels – there is a strong correlation (R2 ≥ 0.87) across all pair-wise absolute rank difference distributions considered. The contrasting gene density data demonstrates that the correlation in the rank-difference profiles is maximal in the gene poor regions (R2 ≥ 0.94). The reduced correlation in the gene rich regions is potentially due to the fact that the variable sensitivity in ChIP on chip experiments has maximal impact here. Overall, the rank order preservation in sites is strongest for the pair-wise combination of C-B1. This observation reflects the fact that the pseudo-median replicate composite distribution is dominated by the B1 replicate profile, which is the experiment with highest sensitivity.
Step 3: Final segmentation of sites based on a stringent signal enrichment threshold
RSSPA optimizes site-detection based on simultaneous minimization of λ and maximization of SE, demonstrated in Fig. . Sites in the λ distribution, that manifest at least two-fold immunoprecipitated enrichment in at least one of the replicates, are considered candidate sites, λs, for the final ranking and segmentation. The two-fold IP enrichment threshold is assessed based on the lowest detection limits of qPCR. Sites with the highest enrichment generally populate the first quartile of the λ distribution (Fig. ). qPCR validation results discussed below show that a two-fold array enrichment threshold is indeed stringent, since the dynamic range of a microarray measurement is compressed in comparison to that of qPCR measurements. The stringency of the SE threshold is user tunable and can be altogether omitted, depending upon the specificity required. The algorithm performance subsequent to the application of the above-mentioned filters has been discussed under method validation.
Simulation results
The simulation results show a progression of RSSPA response obtained from data with very high SNR to data with artificially introduced noise (lower SNR). Fig. shows a simulation result for 100 sites, derived from quadruplicate datasets with significant reproducibility – Spearman's ρ of approximately 0.93. In this simulation the sites are seeded based on the p-value distribution. The three axes of the figure represent the covariates: μ, (x-axis), ε or errval (y-axis) and SAD (z-axis). Following the minimization optimization, a high-density cluster is observed around the vertex (μ = 0, SAD = 0, ε = 0). The vertex cluster represents the sites with maximal intra-replicate ranks and inter-replicate rank consistency and highest statistical confidence. Since the dataset is simulated for a high SNR condition, the diminishing cluster density away from the minima is expected.
In order to test the efficacy of the algorithm, variable levels of outliers are simulated by the introduction of correlated noise in the data. The results show that monitoring of inter-cluster and intra-cluster metrics allow users to dynamically assess reproducibility across replicate experiments. With decreasing SNR the cluster density migrates from the minima (vertex) to the top right where the errors on covariates are maximized. Experiments with highest reproducibility result in a vertex cluster with maximum density and minimum intra-cluster variance. Based on analysis of several ChIP factors, it has been determined that for experiments with greater than one cluster, vertex clusters with a density of 90 percent or higher (90 percent or more of all sites detected occupy the vertex cluster) and maximum vertex cluster radius of less than 1.5 times vertex-cluster standard deviation, manifest a Spearman's ρ of greater than 0.92 and generally have a FDR of less than or equal to 5 percent. By maintaining the number of replicates constant and varying reproducibility, there occurs a migration of sites away from the vertex cluster and an increase in the intra-cluster variance. This highlights the consequence of reduced reproducibility in a ChIP on chip experiment. This class of vertex-cluster sites can provide anchor points for experimentalists to perform further biological validation including qPCR to investigate the dynamics of transcriptional regulation. A study of change in the membership of the vertex cluster, in a time course experiment, is a powerful tool to probe the differential changes in cells subject to external stimuli.
Biological examples
The first set of results is presented for the occupancy of RNA pol II as determined in the HL60 cell line. The experimental design comprises five biological replicates, each with a single technical replicate (5 × 1), yielding a total of ten datasets across IP (five replicates) and control (five replicates). Wilcoxon-p-value and a HL-based signal estimate distributions using a sliding window of 1 kb are computed per replicate. The results of RSSPA for five replicate pairs are shown in Fig. (top and center panels). Each plot shows three axes representative of μ (x-axis), ε (y-axis) and SAD (z-axis). The color map represents the gradient of the site distribution based on λ (left panels) and SE (right panels), with the maximum and minimum denoted by blue and red respectively. Sites with lowest FDR occupy the minimum end of the λ spectrum (left panels) and the maximum end of the SE (right panels) spectrum. The primary observations here are:
i) Site distribution is along a continuum, rather than in isolated clusters.
ii) The ranking distributions in λ and SE show an overall strong correlation of Spearman's ρ of approximately 0.812. However, there is also a distinctly aberrant cluster (indicated by the arrow in the top right panel).
The following are two potential explanations of the above observations. First – while the replicates are not perfect – as evident from outliers in the λ distribution – their overall distributions are relatively similar. If the inter-replicate distributions were significantly different they would separate into discrete clusters highlighting the concordance across some and discordance across others. Second – in order to understand the origin of the aberrant cluster, the continuum is segmented into percentiles based on SE. A Euclidean distance metric is computed across the medoids of the percentile bins; this localizes the aberrant cluster. The aberrant cluster has maximal similarity to the cluster with lowest SE but its p-value-based metrics ascribe it a higher statistical confidence. This may be an artifact of auto-correlation contamination of the p-value. This cluster is eliminated (bottom panels) by applying the stringent overall SE filter of median(SETMs) > 0.693/R (R: maximum number of replicates). Following this elimination, the Spearman's ρ correlation between SE and λ improves from 0.812 to 0.975. The power of this approach is that it enables extraction of sub-optimal sites, albeit with a lower consistency score, whose presence might not be reproducible across replicates.
The following analysis highlights the impact of reproducibility across replicate experiments on the assessment of ChIP-enriched sites. In Fig. (bottom panel), data for three out of five replicates in RNA pol II is summarized. The replicates are chosen on the basis of least pair-wise reproducibility. The plot shows the components of the λ distribution along the three axes: μ (x-axis), ε (y-axis) and SAD (z-axis). The color-map shows the ranking based on SE with blue and red corresponding to the maximum and minimum respectively. The primary observations here are:
i) Unlike the prior result, where the site distribution followed a continuum, here there are three distinct clusters generated primarily in response to the discordance across replicates.
(a) Cluster I is the vertex cluster with μ, SAD and ε approaching 0.
(b) Cluster II represents sites with ranks in the inter-quartile range of each replicate and whose rank order may be discordant in a subset (1/3 or 2/3) of replicates.
(c) Cluster III, the most distal one, reflects sites with ranks in the uppermost quartile within each replicate. These sites are those with maximum discordance in rank order across replicates. The discordance in the rank ordering is potentially introduced by variability in sensitivity, for example, the case where a replicate IP array is significantly more sensitive than the other two. This variability in sensitivity has maximal impact on chromatin modification sites that have an overall lower level of expression of modification.
ii) Increasing intra-cluster dispersion is observed in the more distal clusters.
iii) The segmentation based on ranked p-value metrics has > 99 percent concordance with the segmentation based on ranked signal enrichment, as shown via the color-map. This indicates that there is internal agreement for the ordering of sites based on the three p-value-centric covariates as well as with the ordering based on SE.
Fig. (top panel) shows an example region contrasting the array signal (top two tracks) and p-value enrichment (bottom two tracks) profiles for RNA pol II. Data from two biological replicates (generated with respect to amplified and non-amplified inputs) at the 00 hr are shown. The x-axis represents the genomic coordinate and the y-axes represent the amplitude of SE and pScore, with tracks scaled to their respective distribution bounds to facilitate visual data comparison. In this panel the distance between tick marks on the x-axis is l0 kbp. The observations from the data are the following:
i) Visually, there exists a strong concordance between the pScore and SE distributions. For the replicates the concordance ranges between 0.97–0.99(data not shown).
ii) There is a strong concordance between the putative sites independent of whether the seeding is based on pScore or SE. Based on analysis across all five replicates an 89 percent bp intersection is observed between putative sites seeded by pScore or SE.
iii) Two distinct sites (blue and red) are observed here. One of the sites overlap with the 5' end of the IFNAR1 gene and the other is approximately 3000 bp downstream of the 3'end. The span of the sites range from 800–1100 bp.
HisH4 with a broader distribution across the interrogated parts of the genome presents a significantly different enrichment footprint in comparison to RNA pol II. The acetylation regions often span 1 kb-long (or longer) genomic sequences, and are frequently observed as enrichment plateaus rather than as peaks. Fig. (bottom panel) shows an example region contrasting the p-value enrichment profile of HisH4 in three biological replicates (B1-B3) at the 00 hr. The x-axis is representative of the genomic coordinate and the y-axis or amplitude of the graphs is representative of pScore with tracks scaled to the same bounds to facilitate visual data comparison. Also in this panel the distance between tick marks on the x-axis is 1 kbp. All three replicates show evidence of a pair of ChIP-enriched sites, one slightly upstream and the other overlapping with the 5'end of the DOLPP1 gene on chromosome 9. The putative site upstream of the annotation has a footprint of 750–850 bp, whereas the one overlapping with the 5'end has a footprint of 1800–2000 bp. It is conceivable that the fragmentation in the site is due to the presence of interspersed repeat sequences that have not been tiled; hence in truth it is a single site spanning over 3 kb. Nonetheless, the positive probes contributing to the creation of the site exhibit a highly concordant ChIP enrichment profile over a significant span of the sequence. This represents a footprint very different from sequence specific factors where motif identification might be stronger predictors of target sites. HisH4 exhibits a basal level of acetylation, with plateaus rising above the baseline, representing longer periods of persistence in an acetylated state. This is in contrast is to RNA pol II potentially because the binding occupancy of RNA pol II emulates a switch with two discrete states – bound and unbound. The efficacy of RSSPA for detecting enrichment profiles, irrespective of a site's span, is discussed in the method validation. RSSPA's independence to the span of binding activity is mainly because the model is not shape-based; instead, it utilizes the concordant behavior within a neighborhood of contiguous probes, and consistency of probe behavior across replicate experiments.
Site predictions for both factors were segmented at 1, 5 and 10 percent FDR. The median overlap of predicted sites of RNA pol II occupancy and HisH4 acetylation at 1 and 10 percent FDR was 42.7 and 50.89 percent respectively. The median is generated from the time-course experimental dataset (time-points of 0, 2, 8 and 32 hours). The standard deviation in the overlap across the time-course is 7 percent. These observations indicate that, despite the differences in the enrichment profiles, there is significant recapitulation of sites across both factors – this in itself is a validation of the performance of RSPPA. Approximately 84 percent of the site overlap that occurs is significant at 1 percent FDR. The qPCR validation results (below) show significant concordance with site prediction at the level of 5 percent FDR.
Method validation
Two types of validation data are presented. In the first type, the performance characterization of RSSPA is based upon statistical techniques and in the second type it is based upon validation with respect to qPCR. In the statistical approaches, knowledge of RNA pol II enrichment for the 5' ends of transcripts has been utilized. The significance of the overlap of RNA pol II sites with the 5'ends of known annotation (RefSeq and VEGA) is estimated via boot-strapping. The significance of the 5' enrichment is computed to be p < 0.000l. Fig. shows the overlap of a subset of RNA pol II sites, of span 500–1400 bp, with the 5'ends of known annotation. The x-axis refers to genomic coordinates with the four tracks representing annotation. The track in red is representative of the predicted site intervals; the tracks in blue are representative of RefSeq annotations along the sense and anti-sense strands indicated by RefSeq(+) and RefSeq(-) respectively. The RefSeq(+) track is aligned along the 5' to 3' direction with the RefSeq(-) track being vice-versa; the track in green is representative of the coverage of the ENCODE region on the array. This particular visualization shows RNA pol II sites overlapping the 5'end of transcripts – PIP5K1A, PSMD4 – on the sense strand and TCFL(alias: VPS72), PIK4CB on the anti-sense strand. Sites are also found internal to transcripts, and in intergenic space (data not shown). The presence of these sites can be validated with qPCR, but ascertaining their biological significance – whether they hint at poised or paused states of RNA pol II transcription machinery – requires further biological experimentation.
A pseudo receiver operating characteristic (ROC) curve method [
39,
40] has also been employed to characterize the performance of RSSPA. In the absence of a gold standard, for this analysis the positive regions are derived from the known 5' ends (RefSeq) ± 500 bp of the first exon(and UTR). The negative regions are derived from intergenic regions as well as the inner-most introns of transcripts provided their bounds do not overlap with the positive regions. Since RNA pol II occupancy sites, internal to transcripts and/or in intergenic space have been validated by qPCR the above definition of the negative regions might include true positive sites. Similarly, the definition of positive regions might include true negatives. Therefore, there is some degree of contamination in the delineation of the positive and negative regions, hence the pseudo nature of this analysis. The pseudo-ROC curves provide an accurate estimation of the relative performances of the various algorithms. Fig. compares the performance of RSSPA versus binary segmentation via ROC curves; the x-axis corresponds to the FPR (1 – specificity) and the y-axis corresponds to sensitivity. The solid curves are representative of RNA pol II data derived from chromosome 1 as sampled by the ENCODE array. The performance of RSSPA (orange) has been contrasted with that of binary segmentation using pScore thresholds of 50 (blue), 40 (red) and 30 (magenta). At low FPR (≤ 0.1) a greater than 4x improvement in sensitivity is observed in the RSSPA over the latter. As expected, within the binary segmentation approaches, the sensitivity improves with increased pScore threshold; this comes at a price of increased false negative rate. The dotted curve is representative of RNA pol II data derived from all chromosomes sampled by the ENCODE array. Use of the partial area under the curve (pAUC) [
41] metric yields a recovery of approximately 83.6 and 95.09 percent of
true positive occupancy sites at 1 and 5 percent FPR respectively.
Fig. , summarizes the HisH4 qPCR validation data. A random sampling of n = 72 RSSPA-predicted sites comprising the entirety of the rank spectrum is validated by qPCR. Negative controls are designed from regions on the array which are not predicted as sites. These controls establish a baseline to ascertain whether a site's qPCR enrichment is positive and hence determine the sensitivity and specificity of the proposed algorithm. Fig. illustrates the association of a qPCR based FPR with the statistical significance estimated by RSSPA. This in effect is a biological characterization of the performance of the algorithm. The plot represents the percentage of predicted acetylation sites that are validated by qPCR (y-axis) as a function of the meta p-value (y-axis). The qPCR based true positive rate is estimated by considering the number of validated sites as a fraction of the total number of sampled sites at a given pScore threshold. The range of the tested pScore is from 0–390. A pScore threshold of approximately 56 corresponds to a 95 percent FPR. Aside from providing a methodology to characterize the performance of the algorithm, this also enables an experimentalist to generate an initial ranked list of enrichment sites then further segment sites based on an experimentally derived pScore threshold. Fig. (top panel) delves into the discussion of how true positives are determined based on qPCR validation. qPCR enrichment values for the 72 sites are shown (top panel) and thresholds are set based on 5σ (yellow) or 10σ (red) above the mean of the negative and non-sites(m = 3). Sites with enrichment above these thresholds are considered true positives. Fig. (center panel), summarizes the data discussed in Fig , and further classifies sites into two groups based on whether they meet a 0.2 SE threshold (red) or do not (blue). The primary conclusions here are:
(i) There is a trend of positive correlation between p-value and qPCR enrichment, with R2 = 0.45 (data not shown); the reduced concordance is partially attributable to the fact that the qPCR experiments are conducted using an un-amplified sample whereas the arrays are hybridized to an amplified sample.
(ii) With increasing p-value, there is a higher percentage of overlap with sites that pass the stipulated array enrichment threshold. While true positives are observed in the lower p-value bins, there is a higher degree of contamination due to auto-correlation artifacts, the majority of which are eliminated by the SE filter. Additionally, with the exception of a few outliers, the higher order p-value bins tend to have higher qPCR enrichments.
The primary conclusion therefore, is that in order to achieve a low percent FDR in site prediction, both p-value and SE thresholds need to be employed. A five percent FDR is optimal for most factors studied here (data not shown). The sensitivities (Fig : bottom panel) obtained subsequent to the segmentation of data using the filters: (a) meta p-value of 10-5 (b) array signal enrichment of 0.2 (c) the composite of (a) and (b) are 88%, 87% and 95% respectively.