Example 1: Finding the nearest gene and the distance to the transcription start site of the nearest gene
The output from ChIP-seq or ChIP-chip analysis is a list of binding sites (as chromosome locations) that are significantly enriched in the ChIP sample(s) compared with the corresponding control sample(s). The example below details how to find the nearest gene and the distance to the transcription start site (TSS) of the nearest gene in the human genome for a list of binding sites (named myPeakList) of type RangedData. The distance is calculated as the distance between the start of the binding site and the TSS, which is the gene start for genes located on the forward strand and the gene end for genes located on the reverse strand.
The first step is to load the ChIPpeakAnno
package, an example dataset and an annotation dataset. In this example, the example dataset contains putative STAT1-binding regions identified in un-stimulated cells [2
], and the annotation dataset contains the TSS coordinates and strand information from human GRCh37.
In the next step, the function a nnotatePeakInBatch is called to find the gene with nearest TSS or overlapping gene that is not the nearest TSS and corresponding distance for the list of binding regions. Sometimes, a peak is within a gene but far from the gene's TSS. Setting the parameter output to "both" outputs both the genes with nearest TSS and the overlapping gene. The parameter maxgap sets the maximum gap to be considered as overlapping. The parameter multiple indicates whether multiple overlapping features should be returned for one peak.
>annotatedPeak = annotatePeakInBatch (myPeakList, AnnotationData = TSS.human.GRCh37, output="both", multiple = F, maxgap = 0)
The annotated peaks can be saved as an Excel file for biologists to view easily.
>write.table(as.data.frame(annotatedPeak), file = "annotatedPeakList.xls", sep = "\t", row.names = FALSE)
Plotting the distribution of the peaks relative to the TSS gives a birds-eye view of the peak distribution relative to the genomic features of interest.
>y = annotatedPeak$distancetoFeature [!is.na(annotatedPeak$distancetoFeature) &annotatedPeak$fromOverlappingOrNearest == "NearestStart"]
>hist(y, xlab = "Distance To Nearest TSS", main = "", breaks = 1000, xlim = c(min(y)-100, max(y) + 100))
Such a plot generated from the putative STAT1-binding regions identified in un-stimulated cells ([2
]) shows that the STAT1-binding sites are enriched in regions near TSSs (Figure ). A pie chart was also generated as follows to show the distribution of relative position of the peaks to the nearest gene (Figure ).
Figure 1 Distribution of STAT1-binding sites relative to nearest TSSs in human. The histogram was generated from the putative STAT1-binding regions within 20 kb around TSS identified in un-stimulated cells . The plot shows that STAT1-binding sites are enriched (more ...)
Figure 2 Pie chart of STAT1-binding sites relative to nearest gene in human. The pie chart was generated from the putative STAT1-binding regions identified in un-stimulated cells . The plot shows that STAT1-binding sites are evenly distributed along upstream, (more ...)
>temp = as.data.frame(annotatedPeak)
>pie(table(temp [as.character(temp$fromOverlappingOrNearest) == "Overlapping" | (as.character(temp$fromOverlappingOrNearest) == "NearestStart" & !temp$peak %in% temp[as.character(temp$fromOverlappingOrNearest) == "Overlapping",]$peak),]$insideFeature))
It is also possible to obtain the annotation on-line from Ensembl using the getAnnotation
function as follows prior to calling annotatePeakInBatch
. Please refer to the biomaRt
package documentation for a list of available biomarts
>mart = useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
>Annotation = getAnnotation(mart, featureType="TSS")
To annotate the peaks with other genomic features, it is necessary to change the featureType (e.g., "exon" to find the nearest exon, "miRNA" to find the nearest miRNA, "5utr" to find the nearest 5' utr, and "3utr" to find the nearest 3' utr). It is also possible to pass customized annotation data into the function annotatePeakInBatch. For example, the user may have a list of transcription factor binding sites from the literature, from a different biological replicate, from a different peak-calling algorithm or from a different protein functioning as transcription complex together with the protein in study, and is interested in determining the extent of the overlap to the list of peaks from his/her experiment. Prior to calling the function annotatePeakInBatch, it is necessary to represent both datasets as RangedData, where start is the start of the binding site, end is the end of the binding site, names is the name of the binding site, and space and strand are the chromosome name and strand, respectively, where the binding site is located.
>myexp = RangedData(IRanges(start = c(967654, 2010897, 2496704), end = c(967754, 2010997, 2496804), names = c("Site1", "Site2", "Site3")), space = c("1", "2", "3"))
>literature = RangedData(IRanges(start = c(967659, 2010898, 2496700, 3075866, 3123260), end = c(967869, 2011108, 2496920, 3076166, 3123470), names = c("t1", "t2", "t3", "t4", "t5")), space = c("1", "2", "3", "1", "2"), strand = c(1, 1, -1,-1,1))
>annotatedPeak1 = annotatePeakInBatch(myexp, AnnotationData = literature, output="overlapping", multiple = F, maxgap = 0)
Peaks in text format from peak-calling algorithms can be easily imported to R as data frame then converted to RangedData. For binding sites represented in BED or GFF format, BED2RangedData or GFF2RangedData were provided for converting these data formats to RangedData.
Example 2: Determining the significance of the overlapping and visualizing the overlap as a Venn diagram among different datasets
The second example describes how to determine the significance of the overlap, visualize the overlap in a Venn diagram and obtain merged peaks from different datasets such as different biological replicates, different peak-calling algorithms or different proteins functioning as a transcription complex. Here we give examples using different biological replicates.
The first step is to load the ChIPpeakAnno
package and three example datasets as RangedData
that contains putative Ste12-binding regions identified in yeast from three biological replicates [31
In the next step, the function makeVennDiagram is called to generate a Venn diagram to visualize the overlap among the three replicates. In addition, pair-wise overlapping significance tests were performed with hypergeometric test and p-values were generated. The parameter NameOfPeaks indicates the names of the dataset for labeling the Venn diagram. The parameter maxgap indicates the maximum distance between two peak ranges for them to be considered overlapping. The parameter totalTest indicates how many potential peaks in total that is used in hypergeometric test.
>makeVennDiagram(RangedDataList(Peaks.Ste12.Replicate1, Peaks.Ste12.Replicate2, Peaks.Ste12.Replicate3), NameOfPeaks = c("Replicate1","Replicate2","Replicate3"), maxgap = 0, totalTest = 1580)
As a result, a Venn diagram was generated for visualizing the overlap among the above three replicates. The pair-wise overlap comparisons show that the peaks identified from the replicates overlap significantly (Figure , p-value < 0.0001). The same analysis was applied to the three biological replicates of Cse4 and the overlap between replicate 1 and 2 is significant at p-value < 0.01 while the other two overlapping is significant at p-value < 0.05 (Figure ).
Figure 3 Overlapping of putative Ste12-binding sites among three biological replicates in yeast. The Venn diagram was generated from the putative Ste12-binding sites of three biological replicates in yeast . Hypergeometric test shows that there is a significant (more ...)
Figure 4 Overlapping of putative Cse4-binding sites among three biological replicates in yeast. The Venn diagram was generated from the putative Cse4-binding sites of three biological replicates in yeast . Hypergeometric test shows that the overlapping between (more ...)
The peak ranges from replicates do not overlap perfectly. It is desirable to combine all the overlapping peaks across replicates into merged peaks that cover all the overlapping peaks from the replicates. Calling the function findOverlappingPeaks can generate the merged peaks. Besides the parameters mentioned previously, an additional required parameter multiple indicates whether to return merged peaks from multiple overlapping peaks.
>MergedPeaks = findOverlappingPeaks(findOverlappingPeaks(Peaks.Ste12.Replicate1, Peaks.Ste12.Replicate2, maxgap = 0, multiple = F, NameOfPeaks1 = "R1", NameOfPeaks2 = "R2")$MergedPeaks, Peaks.Ste12.Replicate3, maxgap = 0, multiple = F, NameOfPeaks1 = "Replicate1Repliate2", NameOfPeaks2 = "R3")$MergedPeaks
Next, nearest genes and distances between peak location and nearest TSS can be obtained by annotating the merged peaks with SGD1.01 using annotatePeakInBatch
function illustrated in example 1 (Figure &). The same analysis was performed with Cse4 binding-sites (Figure &). The un-equal variance t-test shows that the distribution of the distance of Ste12-binding sites to nearest TSSs (Figure , 264 ± 36 bases) is very different from that of Cse4-binding sites (Figure , 311 ± 160 bases) (p-value = 0.001). Ste12-binding sites are distributed more towards the upstream of the gene (Figure &) while Cse4-binding sites are distributed more inside and downstream of the gene (Figure &). This result is consistent with what has been observed previously [31
]. The annotated datasets are available in Additional file 1
and Additional file 2
Figure 5 Distribution of Ste12-binding sites relative to nearest TSSs in yeast. The histogram was generated from the putative Ste12-binding regions merged from three biological replicates identified in yeast . The plot shows that Ste12-binding sites are enriched (more ...)
Figure 6 Pie chart of Ste12-binding sites relative to nearest gene in yeast. The pie chart was generated from the putative Ste12-binding regions merged from three biological replicates identified in yeast  The plot shows that Ste12-binding sites are distributed (more ...)
Figure 7 Distribution of Cse4-binding sites relative to nearest TSSs in yeast. The histogram was generated from the putative Cse4-binding regions merged from three biological replicates identified in yeast . The plot shows that Cse4-binding sites are enriched (more ...)
Figure 8 Pie chart of Cse4-binding sites relative to nearest gene in yeast. The pie chart was generated from the putative Cse4-binding regions merged from three biological replicates identified in yeast . The plot shows that Cse4-binding sites are distributed (more ...)
Example 3: Obtaining the sequences around the binding sites for PCR amplification or motif discovery
The third example describes how to obtain the sequences surrounding binding sites (in this example, 100 bp of upstream and downstream sequence) for PCR amplification, cloning or motif discovery [3
The first step is to load the ChIPpeakAnno package and create an example peak dataset as RangedData. Next, the organism-specific BSgenome package is loaded followed by calling the function getAllPeakSequence. The function available.genomes shows a list of available organism-specific BSgenome data packages. In this example, E. coli data package is used due its light-weight.
>peaks = RangedData(IRanges(start = c(100, 500), end = c(300, 600), names = c("peak1", "peak2")), space = c("NC_008253", "NC_010468"))
>peaksWithSequences = getAllPeakSequence(peaks, upstream = 100, downstream = 100, genome = Ecoli)
To convert the sequences to a common FASTA file format, the following function is called.
>write2FASTA(peaksWithSequences, file="test.fa", width = 50)
Sequences for merged Ste12 binding sites were obtained from package BSgenome.Scerevisiae.UCSC.sacCer2
(Additional file 3
). Significant motifs (E-value < 0.0000001) were identified by running MEME [30
] with motif occurrence set as ZOOP, minimum width as 8, maximum width as 20 and other parameters as default (Figure ).
Figure 9 Motif of Ste12-binding sites in yeast. The motif was generated using MEME  and the sequences from the putative Ste12-binding regions merged from three biological replicates identified in yeast . The default parameters for MEME were chosen except (more ...)
Example 4: Obtaining enriched GO terms near the peaks
The fourth example describes how to obtain a list of enriched GO terms associated with adjacent genes using a hypergeometric test.
The first step is to load the TSS annotated peak, which is the result returned from calling the function annotatePeakInBatch
, and the organism-specific GO gene mapping package (e.g., org.Hs.eg.db
for the GO gene mapping for human; for other organisms, please refer to http://www.bioconductor.org/packages/release/data/annotation/
for additional org.xx.eg.db packages).
The next step is to call the function getEnrichedGO. The parameter maxP is the maximum p-value required to be considered to be significant, multiAdj indicates whether to apply multiple hypothesis testing adjustment, minGOterm is the minimum count in a genome for a GO term to be included, and multiAdjMethod is the multiple testing procedure to be applied (for details, see mt.rawp2adjp in the multtest package).
>enrichedGO <-getEnrichedGO (annotatedPeak [1:6,], orgAnn="org.Hs.eg.db", maxP = 0.01, multiAdj = TRUE, minGOterm = 10, multiAdjMethod="BH")
Where enrichedGO$bp contains a list of enriched GO biological process, enrichedGO$mf contains a list of enriched GO molecular functions and enrichedGO$cc contains a list of enriched GO cellular components.
Table shows a list of enriched GO terms for yeast transcription factor Ste12 [31
Enriched GO terms of Ste12-binding sites in yeast.