The workflow of MAnorm is summarized in Figure . First, four bed
files that describe the coordinates of all predefined peaks and aligned sequence reads
of two ChIP-Seq samples are used as input. Second, MAnorm calculates the number of reads
in a window of the same length centered at the summit of each peak. Here the window size
should be comparable to the median length of ChIP-enriched regions; we recommend 2,000
bp window size for histone modifications and 1,000 bp for transcription factor binding
sites. The (M, A) value of each peak is then defined as:
and:
Here,
R1 is the read density at this peak region in ChIP-Seq sample
1 and
R2 is the corresponding read density in sample 2. To avoid
log
20, we added a value of 1 to the real number of reads for all peaks.
Thus, the value of
M describes the log
2 fold change of the read
density at a peak region between two samples, while
A represents the average
signal intensity in terms of log
2-transformed read density. To build the
normalization model, each peak of the two samples being compared was further classified
as a common or a unique peak, depending on whether or not it overlapped (by at least one
nucleotide, as implemented in our analysis in this study) with any peak in the other
sample. The downloadable MATLAB MAnorm package (Additional file
1) also provides a parameter for users to select common peaks based on a
cutoff of peak summit-to-summit distance. By default, this value is set to 500 bp for
histone modifications and 250 bp for transcription factors. In addition, when a peak
overlaps with multiple peaks in the other sample, MAnorm selects the peak with the
smallest summit-to-summit distance to avoid potential bias in building the normalization
model. Next, robust regression was applied to the
M-
A values of common
peaks using iterative re-weighted least squares with a bi-square weighting function
[
33] and a linear model was derived to fit
the global dependence between the
M-A values of these peaks:
To normalize the (
M, A) values of all peaks, MAnorm performed coordinate
transformation to make the
A axis overlap with the linear model derived from
regression. The corresponding (
M, A) value under the new coordinate system was
then taken as the normalized (
M, A) value of each peak. Finally, a
P-value associated with each peak was calculated to quantify the significance
of differential binding at this locus using a Bayesian model developed by Audic and
Claverie [
16]:
in which x and y specify the normalized read count at this peak in sample 1 and sample
2, respectively. Additional file
3 provides further details on
P-value calculations. When the read densities at most peak regions are high,
most peaks associated with absolute
M values > 1 are associated with
significant
P-values. Then, the
M value can be used to rank peaks and
select differential binding regions, as was done in analyzing ENCODE ChIP-Seq data
(Supplementary Table 1 in Additional file
4). When read
densities at most peak regions are relatively low, some of the peaks associated with
absolute
M values > 1 may still fail to obtain significant
P-values.
In such a case, we suggest ranking peaks by
P-values and defining differential
binding regions using combined cutoffs of both
M value and
P-value, as
we did when analyzing the ChIP-seq data from Taslim
et. al. [
12] (Supplementary Table 2 in Additional file
4).
The output of MAnorm includes the normalized (
M, A) value and the corresponding
P-value of each peak. To illustrate the normalization process, the (
M,
A) values of all peaks before and after normalization are plotted together with
the linear model derived from common peaks. The MAnorm package will also generate three
bed files presenting the genome coordinates for the non-differential binding region and
two differential binding regions based on user-specified cutoffs, together with two wig
files (corresponding to the two peak lists under comparison) that can be uploaded to a
genome browser for visualization of the
M value for each peak (Supplementary
Figure 9). MATLAB and R versions of the MAnorm package are available for downloading in
Additional file
1.
Application of MAnorm to ENCODE ChIP-Seq data
The performance of MAnorm was tested using ENCODE ChIP-Seq data describing histone
modifications (H3K4me3 and H3K27ac) [
34] and
transcription factor binding (c-Myc and Pol II) [
35] across three human cell lines: H1 ES cells, HeLaS3 cells, and
K562 cells [
36]. Since these data were
generated and processed by different laboratories associated with the ENCODE project,
the data sets were reanalyzed and the ChIP-Seq peaks in each sample were redefined
using MACS [
4] using a
P-value cutoff
of 1e-10 for histone modifications and a
P-value cutoff of 1e-6 for
transcription factor binding. The peaks of histone modifications were further
filtered by the false discovery rate (FDR) values modeled by MACS. The target genes
of each group of peaks were defined as those RefSeq genes that have a given peak(s)
in the promoter region, defined as the region from 8 kb upstream to 2 kb downstream
of the transcription start site.
Gene expression data for all three cell types were collected from the Gene Expression
Omnibus (GEO) database using accession numbers [GEO:GSE26312] (for H1 ES cells)
[
29], [GEO:GSE2735] (for HeLaS3 cells)
[
37] and [GEO:GSE12056] (for K562
cells) [
38], and the raw data were
reprocessed by dChip [
39]. The differentially
expressed genes were subsequently identified by Significance Analysis of Microarrays
(SAM) [
40] using a combined cutoff of fold
change > 2 and FDR < 0.01. In total, 3,465 genes more highly expressed in H1 ES
cells and 2,224 genes more highly expressed in K562 cells were identified from the H1
ES to K562 comparison; 5,815 genes more highly expressed in H1 ES cells and 1,649
genes more highly expressed in HeLaS3 cells were identified from the H1 ES cell to
HeLaS3 cell comparison; and 3,555 genes more highly expressed in HeLaS3 cells and
5,916 genes more highly expressed in K562 cells were identified from the HeLaS3 cell
to K562 cell comparison. To study the relationship between binding differences in
peak regions and the expression change of the corresponding target genes, we used the
M values of peaks to divide the targeted genes into different groups
separated by integer
M values from -4 to 4, and then calculated the
enrichment score of the overlap between each gene group and those differentially
expressed genes. To avoid extreme enrichment scores, groups composed of < 50 genes
were merged with the larger of the adjacent two gene groups.
Motif scan and hierarchical clustering of motif scores with peak M
value
To detect the potential binding of transcription factors in defined peak regions, we
downloaded the position weight matrixes of 130 core vertebrate motifs from the JASPAR
database [
41] and performed motif scan
[
42] applied to a 1,000 bp window
centered at the peak summit. For each motif
F, the raw motif matching score
at each peak
P was calculated as:
in which S is a sequence fragment of the same length as the motif and B
is the background frequency of different nucleotides estimated from 10,000
random 1,000 bp sequences sampled from the genome. The motif score of motif M
in peak P was defined as the raw motif matching score divided by the
maximum possible score, that is, the raw motif score obtained by the consensus
sequence of the motif.
To identify transcription factors associated with cell type-specific binding of the
ChIP'd proteins, we applied hierarchical clustering with Ward's linkage to cluster
the M value with the motif matching score of JASPAR motifs in all peaks of
cell type 1, and separately the -M value was clustered with the motif scores
in all peaks of cell type 2, using '1 - ρ' as the distance metric,
where ρ is the Pearson correlation coefficient. Only motifs with an
enrichment score > 1.2 and Bonferroni-corrected P-value < 1.0E-5 by
Fisher exact test are shown in the clustering plots.
Comparing the performance of MAnorm and other methods
For total read normalization, we divided the read intensity of each peak region by
the total number of mapped sequence reads. For quantile normalization, we first
divided the whole genome into non-overlapping bins of the same size as the window
used in MAnorm (2,000 bp for H3K27ac), and then calculated the read count in each
bin. Finally, the distribution of bin read counts was normalized to be the same by
matching all quantiles between samples. For normalization by genome-wide MA plots, we
first divided the whole genome into non-overlapping bins of the same size as the
window used in MAnorm (2,000 bp for H3K27ac), and then calculated the M-A
value of each bin. The dependence between M-A value was then
removed by subtracting M values with local linear model fitted by LOWESS
regression from the genome-wide M-A values.
To compare the performance of MAnorm with the model developed by Taslim
et al.
[
12], we used MACS to identify
peaks from the same Pol II ChIP-Seq datasets used by [
12], and then applied MAnorm to compare Pol II binding
profiles between estradiol (E2)-stimulated MCF7 cells and unstimulated MCF7 cells.
The gene expression data of unstimulated and E2-stimulated MCF7 cells was obtained
from the GEO database, accession number [GEO:GSE11352] [
43]. We identified 59 genes showing higher expression in
unstimulated MCF7 cells and 130 genes showing higher expression in E2-stimulated (12
h) MCF7 cells using SAM with fold change > 2 and FDR < 0.1. Finally, the
performance of MAnorm was evaluated by comparing the difference of Pol II binding
determined by both models with the differential expression of target genes.