High throughput DNA sequencing is rapidly changing the landscape of genomic research 
. Recent studies using ChIP-seq technology have revealed genome-wide transcription factor binding sites 
, the distribution of histone modifications across the genome 
, and RPol II binding sites and patterns associated with active transcription of coding genes 
. In this study, we used ChIP-seq-derived RPol II binding data to identify regulatory regions of microRNAs, an important step toward understanding the cis
-acting element and trans
-acting factors that control the microRNA expression levels.
We hypothesized that RPol II binding distribution around the TSS is similar for microRNAs and protein coding genes. To test this assumption, we designed a statistical model to characterize RPol II binding patterns using the signals associated with highly expressed coding genes. Briefly, the RPol II ChIP-seq data was used to determine 10 parameters Φ that describe 5 Gamma distributions, from which the 5 parameters S, B, T, Kp, and Kt of every expressed coding genes are selected. These 5 parameters determine a Poisson parameter (λij) associated with the distribution of the number of RPol II binding fragments in bin j of gene i. Rather than being fitted for every expressed gene, these 5 parameters were treated as hidden variables and bounded by five Gamma distributions; this effectively characterized their between gene variations.
To predict the genomic loci of microRNA transcription start sites, we applied the model on the RPol II binding patterns in the upstream region of all annotated microRNAs. We further used this model to investigate the transcription of microRNAs in response to hormone treatment of two breast cancer cell lines, estrogen-dependent breast cancer cells (MCF7) and the anti-estrogen (tamoxifen) resistant subline (MCF7-T). Our model identified TSS for 72 microRNAs in at least one of four conditions (treatment of MCF7 or MCF7-T with either vehicle or 17β-estradiol). Our results suggest that microRNA predisposition can contribute to the development of antiestrogen resistance in hormone-dependent breast cancer cells. It should be noted that while comparing the predictions between two conditions, we did not take the RPol II binding intensity into account; only two states, “active” and “inactive” promoters, were considered. This is to avoid the potential bias caused by the conditional differences between samples, such as sequencing depth, library preparation errors, and so on. It is possible that for certain active promoters, RPol II binding intensity changes but the signal in both conditions are higher comparing to the background (active in both conditions). Our model cannot distinguish such differences. In addition, RPol II enrichment at the promoter region does not guarantee the expression of downstream gene; many mechanisms can contribute to such deviation, such as RPol II stalling, RNA binding protein-induced post-transcriptional regulation, and so on.
Promoter regions and TSS of non-coding RNAs have recently been identified using strategies based on three types of information: 1) sequence composition upstream of the microRNA, such as GC content, level of conservation, transcription factor binding sites and expressed sequence tags 
; 2) the distribution of epigenetic marks that encode regions of transcriptional initiation 
, or 3) ChIP-chip-derived RPol II binding data using custom tiling arrays designed to target ~50kb upstream the microRNA genes 
. Our approach differs from those studies in several ways. First, we did not use sequence composition as the model base for promoter prediction; instead, that type of information is used, in part, for model evaluation. We found that ~80% of the identified promoter regions overlap with at least one CpG island. In addition, the regions we identified tend to be more evolutionarily conserved. In contrast to sequence information, RPol II binding patterns provide important temporal and spatial measurements regarding the initiation of transcription, important for understanding the mechanism of microRNA transcriptional regulation. Second, our strategy differs from previous efforts using H3K4Me3 marks for successfully identifying microRNA promoter regions 
. H3K4Me3 highly localizes to promoters 
and therefore serves as an excellent transcriptional initiation mark. Therefore, we applied our model to one of the datasets containing both H3K4Me3 and RPol II binding data (
; from a published study measuring the binding patterns of 20 histone modification markers in human CD4+ T-cells). A detailed comparison between the two strategies revealed several interesting features (Appendix S1
), but perhaps most important was that H3K4Me3 maintains a permissive chromatin state that allows for transcription factor binding. However, the permissive chromatin state appears to be necessary, but not sufficient, for transcriptional initiation, as only 23% H3K4Me3-predicted microRNA promoters are recovered by our RPol II strategy (Figure S2
). This observation, however, can in part be caused by the differences of experimental conditions, such as sequencing depth. Third, our approach differs from a recent study attempting to identify TSS-containing regions in pri-microRNAs using RPol II ChIP-chip data from a tiling array platform targeting microRNA upstream regions of up to 50KB. Instead of only examining the microRNA upstream RPol II signals, we first trained our model using the RPol II binding patterns around the TSS of protein coding genes, providing a statistical framework for evaluating the sensitivity and specificity of the model prediction (). In addition, this framework allows for self-correcting of variable RPol II binding signals from different experiments, due to parameter identification for individual samples, making it possible to compare microRNA promoter signals under different biological conditions.
Despites these advantages, RPol II binding patterns around the TSS can only be used to identify regulatory regions of intergenic microRNAs, which account for approximately half of all microRNAs. Current evidence is lacking as to whether intronic microRNAs use their own TSS and promoter sequences or share the same regulatory components with the host gene. Our results suggest that most of the intronic microRNAs share promoter regions with their host genes, with a few exceptions. Similarly, our TSS search focus on 10kb upstream of microRNA annotation. Recent studies suggest that some microRNA promoters are far away from their mature product on the genome; they will not be predicted by the current strategy. Technically, increasing the searching scope is possible; however, the prediction accuracy will be decreased due to the interference with the RPol II signals of surrounding genes. It should also be noted that the model presented here only focuses on the transcriptional regulation in the microRNA biogenesis process; the microRNA expression can also be affected by other steps, including Drosha-involved nuclear processing 
, nuclear export 
, and Dicer-involved cytomastic processing 
. In addition, the computational model proposed here cannot be used to identify regulatory regions of the small percentage of microRNAs transcribed by RNA polymerase III 
As shown in Eq. 2, the current model did not incorporate the potential correlation among 5 parameters that characterize genome-wide RPol II binding patterns around active promoters. Neglecting such correlations will potentially affect the likelihood estimation, and therefore result in less than optimal promoter prediction. However, ROC curve on our current model suggested that the AUC has reached ~0.9 in predicting promoter regions of highly expressed genes (). Hence, additional improvement with better model won't be significantly beneficial. In order to model the correlations among S, B, and T, at least two more random effects need to be introduced into the model to characterize their shared variations. This additional level of hierarchical model will lead to one more layer of integration in the E-step. The numerical integration scheme will be very different, and computational expense will be much higher. Its complexity will exceed the current scope of this paper, and it is a challenging research question.
Our model differs from regular “peak finder” algorithms that are often used to identify binding sites of transcription factors derived from ChIP-seq experiments. An underlying assumption of regular peak finder algorithms is that DNA-binding proteins, such as transcription factors, contain sequence-specific DNA binding domains that target a cluster of cis
-acting DNA elements sharing certain sequence features. While such algorithms can identify DNA binding sites for highly specific transcription factors, they are not appropriate for identifying binding sites for the general transcriptional machinery, such as RPol II, which typically does not display high sequence specificity. In addition, as RPol II activity likely extends beyond the promoter/transcription start site of active genes, algorithms for assessing long-range RPol II binding are needed. Our data demonstrated that RPol II binding pattern around the gene transcription start site follows distinct patterns (), and our model is designed to jointly describe the number of RPol II binding fragments surrounding the TSS, including both promoter and transcript regions; this allows for a more accurate description of RPol II binding pattern features. Finally, the model framework described here can also be used to study the activities of other RPol II-related transcriptional events, such as tissue/condition-specific alternative promoter usage 
, bi-directional promoters 
, and regulatory regions of other RPol II-transcribed non-coding RNA in normal and disease states.