Scanning the entire genomic region for promoter patterns, we found 7,235 highly correlated regions. These are the regions that show high similarity with any of the four patterns modeled by the double-exponential and uniform mixture models. Around 58% (4167) of these matched regions overlapped with known promoter regions (1-kb upstream and downstream of the RefSeq TSSs). Although these regions only represent 22% of the entire known promoters, it is not surprising as it has been known that not all genes are expressed at the same time. Hence, these promoter patterns may represent those that are currently active in the breast cancer model MCF7 cell line. Indeed as shown in Figure , genes whose promoters display these patterns have a significantly higher expression values compared to genes which do not (Mann-Whitney test, p-value <
). Genes expression are determined using FPKM (Fragments Per Kilobase of transcript per Million mapped reads) values derived from RNA-seq data on MCF7 using CuffLink [14
Figure 4 Genes with promoter patterns have higher expression values. Genes whose promoter regions display the Pol-II and H3K4me2 patterns have significantly higher expression value than genes which do not have the same promoter patterns (Mann-Whitney test, p-value (more ...)
For the rest of highly correlated regions (3,068) which cannot be mapped to known genes, we found 1,104 of them falls inside known gene bodies. Some of them are known isoforms. For example gene TANK on chromosome 2 has been found to have isoforms. Interestingly, as shown in Figure , the transcription starting site for its isoform coincide with the location where the promoter pattern is identified. Alternative promoter of gene MAT2B also display the promoter pattern (see Figure ). This is evidence of the existence of the promoter pattern in the alternative promoter regions. On the other hand, there are regions showing the promoter pattern which do not overlap with any known isoform. Some of such regions overlap with exons which indicate that these region are very likely be an unknown alternative promoters (see Figure ).
Figure 5 Promoter patterns are present in the gene bodies. Exons (black bar) and transcriptional orientation (arrow) are indicated at the bottom of each panel. The location of the longest isoform is indicated at the top of each panel. (A) Promoter pattern exists (more ...)
For the rest of correlated regions (1,964), we went to find whether these regions can be associated with any transcripts. In order to do this we first find whether there is overlap between these correlated regions and non-coding RNA tracks (i.e. snoRNA and miRNA) from UCSC genome browser as the RNA-seq protocol does not yield data for small RNAs. We found only 6 regions overlap with the location of non-coding RNA in human genome. One example of this region is shown in Figure . Next, we try to find whether the rest of the regions (1,958) have an overlap with human transcripts listed in the expressed sequence tags (EST) database (from UCSC genome browser). The human ESTs are single-read sequences that usually represent fragments of transcribed genes. We found 1,330 regions that overlap with ESTs. An example of this region is shown in Figure . We have also used RNA-seq data on MCF7 to find transcripts of new (undiscovered) genes. RNA-seq data are processed using CuffLinks [14
] to assemble transcripts. We found four regions which cannot be mapped to other transcripts but are found to be in the proximity of transcripts detected using RNA-seq data. Example of this region is shown in Figure . Detected transcript image is generated using Integrative Genomics Viewer (IGV) [15
]. An overlap with these transcripts is defined as any base pair overlap between the 2-kb area surrounding the center of correlated regions with the starting and end location of the transcripts. A total of 1,340 regions (68%) out of 1,958 region that cannot be mapped to known promoters and their gene body are found to be overlapped with transcripts annotated as non-coding RNAs, ESTs and also those that are detected by RNA-seq. We annotate these 1,340 as predicted alternative promoters as they are shown to be overlapped with some type of transcripts either non-coding or predicted using RNA-seq data.
Figure 6 Regions displaying promoter patterns that overlap with transcripts or other regulatory regions. (A) Region that overlap with non-coding RNA (hsa-mir-375) on chromosome 2. (B) Region that overlap with 7 human ESTs. (C) Region that is overlap with detected (more ...)
Recently there has been new discovery on the presence of RNA polymerase II at enhancer regions. These regions which are found to affect genes far away can manufactured their own RNA molecules. Thus, we try to find whether the same promoter pattern can be found at enhancer regions. We used the binding sites of ER (Estrogen Receptor) and AR (Androgen Receptor) as representative of the enhancer regions since both of these protein have been shown to bind at distal enhancer region. Overlapping unmapped region with ER binding sites, we found 120 regions with similar promoter patterns. This region is shown in Figure . However, after mapping ER binding sites, we did not find any overlap with AR binding sites.
We found 73 out of the rest of the correlated region (504) can be further mapped to other regulatory regions such as CpG island and CCCTC binding factor (CTCF). We used CpG island tracks downloaded from UCSC genome browser to annotate CpG island location. For CTCF, we used the CTCF binding sites that are present in three different cell lines (Jurkat, CD4 and HeLa) since it has been shown that these sites are conserved [16
]. Example of regions mapped to CpG island and CTCF binding sites are shown in Figure and , respectively. Finally, we ended up with 431 region that display the promoter pattern which cannot be mapped to neither known genes, transcripts nor any regulatory regions. Example of this region is shown in Figure (right panel). Ultimately, these unmapped regions may very much be potential new promoters or markers for other annotation that needs further investigation. Figure shows the summary of the overlaps which are done hierarchically from top to bottom. The number of regions that independently matched to each genome annotation is summarized on Table .
Figure 7 Example of regions with Pol-II (top-red) and H3K4me2 (bottom-blue) patterns. Left panel, the region predicted has an overlap with a RefSeq gene called RPS6KB1. Transcriptional orientation (arrow) is indicated at the bottom. Right panel, a potential promoter (more ...)
Figure 8 Summary of overlap of correlated regions with genome annotation, transcripts and other regulatory regions. Number of regions displaying promoter patterns that are found to be overlapping with genome annotation, transcripts or other regulatory regions. (more ...)
Number of correlated regions that overlap with each genome annotation or transcripts including those that are detected using RNA-seq
We investigated the overlap of these correlated regions with more than one genome annotation (Figure , image is generated using Venny [17
]). We found that almost all of the correlated regions that overlap with RNA transcripts also overlap with EST (99%,2703 out of 2707). There are about 26% of correlated regions which exclusively map to ESTs and only 3 map exclusively to TSS of RefSeq genes. There are still about 5% (431) of the correlated that do not overlap with known genes, transcripts or other regulatory regions, they may still represent potential novel promoters. For example, Figure (left panel) shows an example of a putative promoter region that overlap with a known gene called RPS6KB1
on chromosome 17. The Pol-II and H3K4me2 patterns are very prominent around the TSS of this gene with the combination of unimodal Pol-II peak and the bimodal H3K4me2 peak. Figure (right panel) shows an example of a putative novel promoter region that does not overlap with any of the above genome annotations. Although, the pattern on the right also display unimodal Pol-II peak and bimodal H3K4me2 peak just like the known promoter pattern on the left, it does not have tails in the transcribed region. As we have discussed earlier, this phenomenon could be due to Pol-II stalling [7
Venn diagram of correlated region that overlap with more than one genome annotation. Most of the region that display promoter patterns overlap with RefSeq genes, ESTs and RNA-seq (2182 regions).