|Home | About | Journals | Submit | Contact Us | Français|
Understanding the transcriptional regulation of microRNAs (miRNAs) is extremely important for determining the specific roles they play in signaling cascades. However, precise identification of transcription factor binding sites (TFBSs) orchestrating the expressions of miRNAs remains a challenge. By combining accessible chromatin sequences of 12 cell types released by the ENCODE Project, we found that a significant fraction (~80%) of such integrated sequences, evolutionary conserved and in regions upstream of human miRNA genes that are independently transcribed, were preserved across cell types. Accordingly, we developed a computational method, Accessible and Conserved TFBSs Locater (ACTLocater), incorporating this chromatin feature and evolutionary conservation to identify the TFBSs associated with human miRNA genes. ACTLocater achieved high positive predictive values, as revealed by the experimental validation of FOXA1 predictions and by the comparison of its predictions of some other transcription factors (TFs) to empirical ChIP-seq data. Most notably, ACTLocater was widely applicable as indicated by the successful prediction of TF→miRNA interactions in cell types whose chromatin accessibility profiles were not incorporated. By applying ACTLocater to TFs with characterized binding specificities, we compiled a novel repository of putative TF→miRNA interactions and displayed it in ACTViewer, providing a promising foundation for future investigations to elucidate the regulatory mechanisms of miRNA transcription in humans.
Transcriptional regulation is mediated by cis-elements and the transcription factors (TFs) that bind to these elements (1). Precise identification of cis-elements or transcription factor binding sites (TFBSs) is fundamentally important to decipher the complex transcription regulatory networks. Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq) has become the most powerful tool to profile TFBSs (2,3). Cis-elements are, however, often active only in specific cell types or during certain development stages; therefore, a comprehensive catalog of all cis-elements would require a thorough investigation of various physiological conditions. Current applications of ChIP-seq are limited by the availability of ChIP-grade antibodies, and the reported binding sites cannot be determined at nucleotide-level resolution. Computational methods have long been sought as an alternative to time-consuming experimental studies; however, owing to the short, degenerate nature of TFBSs, predictions usually contained an overwhelming number of false positives (FPs), which has been termed the ‘futility theorem’ (4). To improve the accuracy of predictions, existing methods often focus on promoter regions of protein-coding genes and have employed other criteria, such as conservation, co-operation and co-regulation (4,5). Recently, owing to the rapid progress in measures of chromatin accessibility by DNase-seq (6,7) and FAIRE-seq (Formaldehyde Assisted Isolation of Regulatory Elements) (8,9), as well as the elucidation of histone modification profiles based on ChIP-seq (10,11), chromatin accessibility and histone modification profiles of human cell lines have been cataloged (12,13). The accumulation of these data has provided an unprecedented opportunity to improve TFBS predictions in humans and methods incorporating such information have been successfully developed to work in a cell-specific manner (14–16). However, the human body contains more than 200 unique cell types, which could be under various physiological and pathological conditions. It is unpractical to thoroughly profile the chromatin accessibility and histone modification of all types of human cells under all conditions, and thus the applicability of such methods is limited to cell types whose experimental data are available. Currently, the major computational problem to solve is to enhance the applicability of predictions.
While the methods for predicting TFBSs associated with protein-coding genes are fairly comprehensive, unfortunately, transcriptional regulation genomics of non-coding RNAs (ncRNAs), such as microRNAs (miRNAs), which have been found to collaborate with proteins in essential biological processes, have been much less investigated. Accurate maps of TFBSs associated with ncRNA genes will be essential for a comprehensive understanding of protein-coding and ncRNA genes coordinate regulation network. miRNAs are a class of small ncRNAs, which are expressed in a spatio-temporal manner and play key roles in diverse biological processes through targeting and suppressing the expression of protein genes (17–19). Extensive studies have been performed on miRNA gene identification, expression analysis and function. For example, more than 1000 human miRNA genes have been documented in miRBase (20); expression patterns have been profiled from about 200 major human organs and cell types (21) and more than 3000 confirmed target genes of human, mouse or rat miRNAs are included in miRWalk (22). Although previous studies have shown that miRNAs were under tight transcriptional regulation (23–27), only few investigations have been conducted on the transcriptional regulation of human miRNA genes. Currently, only 221 human TF→miRNA (miRNA regulation by TF) interactions have been reported in TransmiR (28). This certainly represents only a small fraction of the total number of regulatory networks, and large-scale investigations are needed to decode the full set of pathways governing the expression of human miRNAs. Human miRNAs are either located in the introns of protein-coding genes (intronic miRNAs) or between protein-coding genes (intergenic miRNAs). Intronic miRNAs are likely to be transcribed along with their host genes (29,30), and existing TFBS prediction techniques for protein-coding genes can be used for these systems. However, for intergenic miRNAs, which are independently transcribed, the distances between transcription initiation sites (TSSs) and miRNA-coding regions dramatically vary, ranging from a few hundred bases (31) to 30-kb upstream (32). Also, TFs might not almost exclusively bind at proximal promoters (12); it is likely that a sufficient number of distal TFBSs would locate in regions between miRNA promoters and their coding regions providing additional regulatory information. To fully explore the transcriptional regulation of human intergenic miRNAs, it is necessary to examine large genomic regions to locate all the possible TFBSs, and thus existing methods employing only simple pattern matching (33,34) would lack a reasonable accuracy. Even filtered out with significant conservation (such as Conserved TFBS track in UCSC human genome browser and its web interface, PuTmiR) (35,36), there would be still a large amount of FPs due to the presence of slowly evolving neutral regions (37). On the other hand, recent methods based on cell-specific experimental data (38) have improved in accuracy, but the scope of their applicability is limited. Prediction methods that are suitable for human intergenic miRNAs, particularly those with high accuracy and a wide applicable range, are still lacking.
Here, we showed that the conserved and accessible chromatin sequences integrated from 12 cell types, immediately upstream of human intergenic miRNAs found also in the mouse and rat (referred as HMR intergenic miRNAs), were highly preserved across cell types. Therefore, we developed ACTLocater (Accessible and Conserved TFBSs Locater) incorporating known chromatin features to identify TFBSs associated with HMR intergenic miRNAs. Applying ACTLocater to selected TFs showed that the positive predictive values (PPVs) of predictions were greatly improved compared to conventional methods based on sequence conservation alone. Although ACTLocater was based on information from a limited number of cell types, it successfully predicted TF→miRNA interactions in cells whose chromatin accessibility information was not incorporated, suggesting that ACTLocater could be applied to a wide range of cell types. By using ACTLocater for TFs with known binding specificities, we established a comprehensive human TF→miRNA interaction database, ACTViewer. The resultant maps provided a solid foundation for understanding the regulatory pathways underlying HMR intergenic miRNA expression.
All human miRNAs and genome annotations were obtained from the UCSC Genome Browser (35) hg18 assembly. We considered miRNAs that reside between the protein-coding genes as intergenic miRNAs and identified human intergenic miRNAs conserved in mice and rats by the miRviewer (39). We next adopted the TSS predictions of these miRNAs from previous studies (26,40,41) and manually calibrated them according to the full-length cDNA data, SwitchGear TSS predictions and ENCODE promoter-associated histone mark (H3K4me3) on nine cell lines track in the UCSC Genome Browser. If any full-length cDNAs overlapped with the miRNAs, the 5′-terminal region of the full-length cDNA was used as the TSS for the corresponding miRNA; or if SwitchGear TSS prediction in the miRNA upstream region coming along with H3K4me3 peaks present in multiple cell lines, the SwitchGear prediction was used; otherwise, TSSs for the corresponding miRNAs were directly adopted from previous studies. Same-strand miRNAs with common predicted TSSs were considered as one transcriptional unit.
We extracted the genomic coordinates of the upstream 40-kb flanking regions or truncated flanking regions of genes upstream of HMR intergenic miRNAs from human genome. For clustered miRNAs, the regions upstream of each individual sequence were merged. We collected a total of 14 DNase-seq and four FAIRE-seq datasets (as listed in Supplementary Table S2) from the Data Coordination Center (http://genome.ucsc.edu/ENCODE/) of the ENCODE Project (12,13) and merged them as chromatin accessibility reference. We then extracted Human/Mouse/Rat alignments within the chromatin accessibility reference and in the regions upstream of HMR intergenic miRNA units from 17-way genome-wide multiple alignments using Galaxy (42).
To calculate the signal-to-noise ratios of FOXA1 predictions by ACTLocater and the Conservation method, we shuffled (99 runs) HMR accessible alignments and Human/Mouse/Rat alignments to generate alignments with the same length, base composition and patterns of gaps and conservation (43), respectively. The signal was defined as the number of FOXA1-binding sites predicted with the true alignments, whereas the noise was defined as the number of sites predicted with the shuffled alignments.
To locate the TFBSs from the HMR accessibility alignments, we implemented two searching methods, CScanner and MScanner, based on the consensus model and the PSSM model (44).
We designed CScanner to search both strands of every human sequence for sequence windows, and its corresponding orthologous sequences in mouse and rat matched the consensus. We implemented MScanner to score sequences on both strands as potential TFBSs using the formula (44):
where i is the position within the site, is the relative frequency of base b in the genome and is the observed relative frequency of base b at that position (from the matrix).
For each species, scores were normalized to a 100-point scale. The average score was defined as the arithmetic mean of the species scores. Predictions were based on the thresholds of the species scores and the corresponding average score.
PSSMs and/or consensus sequences used for selected TFs predictions by ACTLocater were as follows: for FOXA1, M00131 from TRANSFAC (45), MA0148 from JASPAR (46) and consensus sequence TRTTKRYTY (47); for c-Myc, M00118, M00123 and M00615 from TRANSFAC, MA0059 and MA0147 from JASPAR, and consensus sequence CACGTG (48); for the E2F family, M00024, M00050 and M00516 from TRANSFAC, MA0024 from JASPAR and consensus sequence TTTSCGC (49); for MyoD, a PSSM derived from ChIP-seq data (50) (details in ‘Supplementary Methods’ section); for NRSF, M00256 from TRANSFAC, MA0138 from JASPAR and two PSSMs from previous studies (51,52); for Oct4, M00135, M00138, M00161, M00195 and M00210 from TRANSFAC, MA0142 from JASPAR and consensus sequence ATGCWAAT (53). The cutoff values for species scores and average score for MScanner were set to 80 and 85, respectively, for all the PSSMs except NRSF. Because this protein recognized a long motif, its species and average score cutoffs were set to 75 and 80, respectively.
We collected the binding peaks from the ENCODE Project (12,13) and previous genome-wide ChIP experiments (54–56) to evaluate the TFBS predictions. ChIP peaks of NRSF have been identified by MICSA (57) using the raw data from the ENCODE Project. For each TF, we merged all ChIP peaks. We next assessed all predictions for FOXA1, c-Myc, the E2F family and NRSF by comparing the TFBSs predicted in the human genome with the corresponding genome-wide ChIP datasets generated by using human cell lines. A prediction was considered as a true positive (TP) if the predicted site was inside a merged ChIP peak; otherwise, the prediction was considered as a FP. We calculated the PPV as TP/(TP + FP). A merged ChIP peak was successfully predicted (TP) if there was at least one predicted TFBS within it; otherwise, the peak was annotated as a false negative (FN). And we calculated the sensitivity as TP / (TP + FN).
We assessed the evidence supported MyoD-binding sites by comparing the TFBSs predicted in the mouse genome with ChIP-seq datasets obtained from mouse cells (50) (ChIP-seq peaks were identified as described in ‘Supplementary Methods’ section). And we assessed evidence supported Oct4-binding sites by comparing the TFBSs predicted in the human and mouse genomes with ChIP-seq datasets obtained from human and mouse embryonic stem cells (26), respectively.
We defined accessible chromatin regions of skeletal muscle cells by merging the DNase-seq peak regions of HSMM, HSMMtube and SKMC skeletal muscle cells and defined those of embryonic stem cells by merging the DNase-seq or FAIRE-seq peak regions of H1-hESC, H7-hESC and H9ES embryonic stem cells. All the DNase-seq and FAIRE-seq datasets were obtained from the Data Coordination Center of the ENCODE Project, as listed in Supplementary Table S2.
To construct the ACTViewer database, we collected PSSMs from TRANSFAC, JASPAR and UniPROBE. Then, we applied MScanner to the HMR accessible alignments for each PSSM. Thresholds for species and average scores were set to 80 and 85, respectively. ACTViewer database was developed as described previously (58).
MDA-MB231 cells were a gift from Prof. Yan Zhang (Sun Yat-sen University, Guangzhou). MCF-7 cells and HEK293T cells were obtained from the Shanghai Cellular Institute of Chinese Academy of Sciences (Shanghai, China). All cells were grown in Dulbecco’s modified Eagle’s medium (GIBCO) supplemented with 10% fetal bovine serum (GIBCO) and were cultured in a 37°C incubator with 5% CO2.
The FOXA1 cDNA lentiviral vector and control empty vector were purchased from Fulengen Corporation (Guangzhou, China). To produce the lentivirus, we transiently co-transfected FOXA1 cDNA lentiviral vector or empty vector with the ViraPower™ Lentiviral Packaging Mix (Invitrogen) into HEK293T cells using Lipofectamine 2000 reagent (Invitrogen). MDA-MB231 cells were infected and then selected with puromycin (Sigma–Aldrich).
We performed ChIP assays in MCF-7 cells by using a chromatin immunoprecipitation kit (Millipore) according to the manufacturer’s instructions. Protein–DNA complexes were precipitated with a control IgG or an anti-FOXA1 antibody (ab5089, Abcam). PCR primers are listed in Supplementary Table S5.
Proteins were extracted with RIPA lysis buffer. FOXA1 protein was revealed with a polyclonal antibody (C-20, Santa Cruz). Signals were detected by the Super Signal Western Pico chemiluminescent substrate (Pierce). Western blotting of GAPDH on the same membrane was used as a loading control.
miRNA microarrays were performed at the Beijing CapitalBio Corporation. Total RNA content extracted by TRIzol reagent (Invitrogen) from MDA-MB231 cells stably overexpressing FOXA1 or control cells was labeled with biotin. Labeled samples were hybridized to GeneChip®miRNA Array (V2.0). Raw data were normalized and analyzed using the miRNA QC tool software (Affymetrix). We considered the HMR intergenic miRNAs with a fold change >1.5 or <0.75 in the FOXA1 overexpressing MDA-MB231 cells as FOXA1-affected HMR intergenic miRNAs.
Total RNA content was reverse-transcribed to cDNA using the Primescript RT reagent kit (Takara). Real-time PCR was carried out using SYBR Premix ExTaq (Takara) according to the manufacturer’s instructions. The relative expression of FOXA1 was calculated using the comparative 2−ΔΔCt method and was normalized to GAPDH. The following primer sets were used: for FOXA1, 5′-GAAGATGGAAGGGCATGAAA-3′ (forward) and 5′-GCCTGAGTTCATGTTGCTGA-3′ (reverse); for GAPDH, 5′-CCATGGGGAAGGTGAAGGTC-3′ (forward) and 5′-GAAGGGGTCATTGATGGCAAC-3′ (reverse). To detect the miRNA expression levels, Bulge-Loop™miRNA qPCR primer sets and U6 primer sets were purchased from RiBoBio Corporation (Guangzhou, China). The relative expression levels of miRNAs were normalized to U6 snRNA. We performed all real-time RT–PCR experiments in triplicate.
As components of conserved regulatory systems (59), we reasoned that HMR intergenic miRNAs rely not only on the conservation of the miRNA sequences but also on the transcriptional regulation elements. Via sequence examination and homology searches, we identified 203 HMR intergenic miRNAs, which were grouped into 106 transcriptional units (Supplementary Table S1). We arbitrarily chose a 40-kb region upstream of miRNA as the TFBS search area, such space was sufficient as indicated by the fact that a significant fraction of TSSs of human intergenic miRNAs are within 10-kb upstream regions (60,61).
To explore the features of accessible chromatin sequences upstream of HMR intergenic miRNAs, we mapped chromatin accessibility information of 12 cell types (Supplementary Table S2) obtained from the ENCODE Project (12,13) to the search area. In humans, mice and rats, 63.7% of the accessible sequences were conserved, whereas only 43.0% of the inaccessible sequences were conserved, indicating that accessible regions are rich in functional elements. Conserved accessible sequences were found upstream of 101 HMR intergenic miRNAs (Figure 1A and Supplementary Dataset S1). We next employed a leave-one-out cross-validation method to evaluate the conserved accessible chromatin information across cell types. Each round, conserved accessible sequences of one cell type were left out, and the sequences of left-out cell type recovered from those of the remaining cell types were then examined. The percentages of sequences recovered ranged from 74.7% to 90.0% (Figure 1B). On average, for each examined cell type, 79.7% of the conserved accessible sequences could be recovered from the remaining cell types. Such high values were not due to the similarities across cell types (Supplementary Table S3). SK-N-SH_RA and SKMC, which did not have any similar cell types in our collection, still had high sequence recovery values (80.2% and 78.5%, respectively). These results indicated that a significant fraction of conserved accessible sequences integrated from diverse sources were preserved across cell types and also implied that such integrated sequences may provide valuable clues for other cells whose chromatin accessibility has not been investigated.
Based on these observations, we designed the ACTLocater algorithm to incorporate chromatin accessibility and evolutionary conservation profiles, as shown in Figure 1C. First, we constructed a reference for accessible chromatin in the human genome by merging the DNase-seq and FAIRE-seq peak regions of 12 cell types. Then, we extracted multiple alignments within the accessible reference regions (referred to as HMR accessible alignments) from the Human/Mouse/Rat genome alignments. Finally, we identified TFBSs in HMR accessible alignments by CScanner and/or MScanner (details in ‘Materials and Methods’ section).
Initially, the search area contained 9750 conserved sequence blocks (~1.45 Mb human DNA sequences). When restricted to the accessible chromatin reference regions, only 3217 blocks (~0.34 Mb human sequences) remained. As the search area was reduced by 76.6%, the prediction specificity was greatly improved. To illustrate, we made a simple prediction of the Twist→hsa-miR-10b interaction similarly to that in Ma et al. (24) and visualized it in the UCSC genome browser (Figure 1D). Within the 4-kb region upstream of hsa-miR-10b, we identified 12 E-boxes (CANNTG), of which 5 were conserved. By comparing the predictions against the accessible chromatin reference, only one E-box remained, which was the proven Twist-binding site (24), demonstrating the high efficiency of ACTLocater.
To assess the performance of its predictions, we applied ACTLocater to predict HMR intergenic miRNAs regulated by FOXA1. A total of 52 binding sites were predicted (Supplementary Table S4). In contrast, the Conservation method (applying ACTLocater without considering chromatin accessibility) predicted 223 sites, indicating that incorporating chromatin accessibility information can greatly reduce the number of predictions.
To evaluate the significance of the predicted FOXA1-binding sites, we repeated predictions with shuffled input alignments. As shown in Figure 2A, the predictions of ACTLocater had significantly higher signal-to-noise ratios than those of the Conservation method (Two-sided two-sample Student’s t-test, P = 2.2 × 10−19). We then used FOXA1-binding peaks from previous genome-wide ChIP studies (54–56) to evaluate the quality of these predictions. Of the binding sites predicted by the Conservation method, 13.9% (31/223) could be validated against the known set of binding sites. For ACTLocater, the number of predicted sites reduced to 52, 15 of which were known binding sites and the PPV increased to 28.8%. The sensitivities of the Conservation method and ACTLocater were 15.6% and 8.4%, respectively. ACTLocater demonstrated substantially higher specificity than the Conservation method, while reducing the sensitivity. We next conducted ChIP assays in MCF-7 cells to verify predictions, which were not validated by previous genome-wide ChIP studies. Representative examples of ChIP results are shown in Figure 2B, and full results can be found in Supplementary Figure S1. 81.1% (30/37) of these sites were confirmed in MCF-7 cells, and thus the overall PPV of ACTLocater predictions was 86.5% (45/52), emphasizing its specificity in predicting FOXA1-binding sites. Furthermore, these results also indicated that ACTLocater could discover novel sites which were missed by high-throughput experimental methods.
To examine the functional importance of the predicted sites, we employed miRNA arrays to monitor the miRNA expression changes in FOXA1 overexpressed MDA-MB231 cells (Supplementary Table S6). FOXA1 overexpression was validated by real-time RT–PCR (Figure 2C; two-sided two-sample Student’s t-test, P = 0.0048) and western blotting (Figure 2D). We found that 50.0% (20/40) of the HMR intergenic miRNA units predicted by ACTLocater and 46.3% (31/67) of those found by previous genome-wide ChIP studies (54–56) showed a change in expression. 90.0% (18/20) of the FOXA1-affected HMR intergenic miRNA units predicted by ACTLocater were also identified by the genome-wide ChIP studies, indicating that the miRNA units detected by ACTLocater correlate well with those found by previous genome-wide ChIP studies. However, 13 miRNA units detected by the genome-wide ChIP studies were not predicted by ACTLocater. A detailed examination of the genomic features of the corresponding ChIP peak regions showed that these sequences lacked a conserved FOXA1-binding motif (Figure 2E), which is necessary for ACTLocater prediction. Finally, we used a more sensitive method, real-time RT–PCR, to further investigate the effect of ectopic expression of FOXA1 on miRNAs. We randomly selected 22 miRNA units and chose one member miRNA for testing from each unit. 77.3% (17/22) of the randomly selected miRNAs showed significant changes in expression levels, while two negative-control miRNAs showed no significant changes (Figure 2F). Hsa-miR-130a, of which the predicted site was not experimental supported, was likely regulated by FOXA1 indirectly. It should be noted that some of the examined miRNAs have multiple copies in the human genome, and the detection of real-time RT–PCR cannot ensure that the expression changes of multi-locus miRNAs were from the locus with predicted FOXA1-binding sites. When considering the miRNAs associated with experimental supported FOXA1-binding sites and derived from a unique locus, there are still 73.3% (11/15) miRNAs showed significant changes. These results indicated that most of the predicted FOXA1-binding sites were functional. Hsa-miR-202, hsa-miR-129-3p and hsa-miR-24, members of three miRNA units uniquely predicted by ACTLocater, all showed a significant change in expression levels, suggesting that ACTLocater could predict novel functional FOXA1-binding sites missed by genome-wide experimental assays.
Distinct histone signatures have been found for proximal promoters and distal enhancers. Previous studies have shown that strong H3K4me1 and H3K4me3 signals were associated with promoters, whereas strong H3K4me1 and weak H3K4me3 signals were associated with enhancers (10,62). To examine whether ACTLocater could predict not only proximal but also distal FOXA1-binding sites, we collected ChIP-seq peaks of histone modifications from various cells released by the ENCODE Project (12,13) (datasets are listed in Supplementary Table S2) and used them to classify the FOXA1-binding sites. Of the 51 predicted binding sites with histone modifications available, 37 were classified as proximal TFBSs and 14 were classified as distal TFBSs (Figure 2G). These classifications were supported by the fact that proximal TFBSs were usually associated with the Pol II ChIP-seq peaks, but not the case for distal TFBSs (Two-sided Fisher’s exact test, P = 0.00044; Pol II ChIP-seq datasets were obtained from the ENCODE Project, as listed in Supplementary Table S2). Moreover, the distances of proximal TFBSs from the corresponding TSSs were significantly less than distances between distal TFBSs and their TSSs (Two-sided two-sample Student’s t-test, P = 0.0028). These data indicated that ACTLocater predicted both proximal and distal FOXA1-binding sites. Six sites were found to be located in genomic regions between TSSs and miRNAs, suggesting that these regions could contain additional functional regulatory elements.
To further assess its performance, we applied ACTLocater to identify the target miRNAs of c-Myc, the E2F family and NRSF (also known as REST). The program predicted 29 c-Myc, 25 E2F and eight NRSF-binding sites (Supplementary Table S7, S8 and S9), corresponding to 22 c-Myc→miRNA, 18 E2F→miRNA and seven NRSF→miRNA interactions, respectively. The PPVs of these predictions based on ACTLocater were all superior to that of the Conservation method, while the sensitivities of these two methods were comparable (Table 1). In previous studies (23,25), 10 HMR intergenic miRNA units have been identified as directly transcriptional targets of c-Myc in P493-6 cells. ACTLocater successfully predicted 50% (5/10) of these miRNA units. As the c-Myc binding was assayed in conserved regions without considering the presence of a binding motif in Chang et al. (25), we examined the binding regions (PCR regions in the previous study spanned 1 kb) of the missing units and found that c-Myc binds to these regions via non-canonical motifs, causing them to be overlooked by ACTLocater. Additionally, some c-Myc predictions not validated by ChIP-seq were supported by other studies. For example, H19 RNA, the precursor of hsa-miR-675 and hsa-miR-9-3 have been found to be direct targets of c-Myc (63,64). These results showed that ACTLocater can yield consistent high-specificity predictions.
To determine the scope of our method, we used ACTLocater to predict cell-specific MyoD→miRNA and Oct4→miRNA interactions that occur in skeletal muscle and embryonic stem cells, respectively. Meanwhile, we generated two predictions with the chromatin accessibility data of skeletal muscle and embryonic stem cells and used them as independent verifications of the ACTLocater predictions.
To test whether MyoD→miRNA interactions could be predicted by the chromatin accessibility data of non-muscle cell types, we removed the chromatin accessibility data of SKMC cell from our chromatin accessibility reference. A total of 22 MyoD-binding sites were predicted by ACTLocater (Figure 3A and Supplementary Table S10). Also, we made a similar prediction with the chromatin accessibility data of skeletal muscle cells and yielded 26 MyoD-binding sites (Figure 3A and Supplementary Table S11). To avoid the potential differences in FP predictions between two datasets, we only considered sites with evidence supported. As shown in Figure 3B, 71.4% (10/14) of evidence supported sites (50) made with the chromatin accessibility data of skeletal muscle cells could be successfully predicted with those of non-muscle cells.
We next performed a similar analysis between the two Oct4 predictions, which were made with chromatin accessibility data of non-stem cells (our chromatin accessibility reference) and embryonic stem cells, respectively. Both predictions reported 16 Oct4-binding sites (Figure 3C; Supplementary Table S12 and S13). As shown in Figure 3D, 77.8% (7/9) of evidence supported sites (26,65) made with the chromatin accessibility data of stem cells could be successfully predicted with those of non-stem cells.
Taken together, most cell specific TF→miRNA interactions were successfully predicted, indicating that ACTLocater could be applied to a wide range of cell types, such as cells whose chromatin accessibility profiles had not yet been characterized.
Given that ACTLocater achieved high PPVs and could be widely applied, we constructed a regulatory map composed of TFBSs associated with HMR intergenic miRNAs by applying ACTLocater to the PSSMs from TRANSFAC (45), JASPAR (46) and UniPROBE (66). We found that different names in these database entries sometimes corresponded to TF isoforms or even to the same TF, which poses a serious challenge to consistent TF nomenclature. For the present study, we separated the predictions with the names and accession numbers retained. We retrieved a total of 295 PSSMs from TRANSFAC and predicted 13 688 TFBSs, corresponding to 4085 TF→miRNA interactions. Additionally, we retrieved 129 PSSMs from JASPAR and predicted 13 170 TFBSs, corresponding to 3320 TF→miRNA interactions. Finally, we retrieved 389 PSSMs from UniPROBE and predicted 938 TFBSs, corresponding to 724 TF→miRNA interactions. Although false predictions could not be avoided, a large number of predictions were assumed to be true TFBSs according to the high precision of ACTLocater, indicating that a complex regulatory network governing the expression of HMR intergenic miRNAs remains to be decoded.
To provide an effective view of the TFBSs predicted by ACTLocater, we developed the ACTViewer database (ACTViewer is accessible at http://deepbase.sysu.edu.cn/ACTViewer/). We provided a genome browser and search forms in ACTViewer to facilitate data access. Predictions from ACTLocater could be viewed using the ACTViewer browser (Figure 4A). The browser displayed annotation tracks beneath genome coordinate positions, allowing rapid visual correlation of different types of genome annotations. Zooming and scrolling controls helped to narrow or broaden the displayed chromosomal range to focus on the exact region of interest. We displayed ACTLocater predictions as tracks according to the PSSMs data sources and added secondary links from entries within prediction tracks to lead to corresponding details from each entry (Figure 4B). With the ACTViewer browser, users can conveniently examine TFBSs that are clustered, overlapping or located in special genomic regions. To access the predictions made about a specific TF or miRNA, we also provided a list of search forms according to TF and/or miRNA. A query using a specific TF could retrieve the miRNAs predicted as its target genes, and using a specific miRNA could allow users to find the predicted TFBSs associated with the miRNA input (Figure 4C and D, respectively). Additionally, users could retrieve interactions from ACTViewer by specifying a TF and a miRNA simultaneously (Figure 4E). In conclusion, ACTViewer provided interfaces for integrating visualization and rapid retrieval of the ACTLocater predictions.
This work has presented a new computational method called ACTLocater that integrates evolutionary conservation and chromatin structure to predict TF→miRNA interactions with significantly improved PPVs. Most notably, ACTLocater uses chromatin accessibility data from a limited number of cell types for reference, but could also be applied to other cell types whose chromatin accessibility has not yet been characterized. Thus, the ACTLocater method and the ACTViewer resource have considerable potential to advance studies of the miRNA transcriptional regulation that underlies diverse human biological processes.
Previously, several studies have also attempted to predict the TFBSs associated with human miRNAs. For example, miRGen (33) has mapped all vertebrate TF matrices from TRANSFAC to the regions spanning 5-kb upstream to 1-kb downstream of the miRNA TSSs. N@TN@RIM (34) has predicted potential TFBSs in the promoter regions of human miRNAs based on PSSMs from JASPAR and the ‘CpG island’ signal. However, these methods are based on simple pattern matching and are thus prone to false predictions. Until this point, the comparative genomics approach, such as the TFBS conserved track in the UCSC genome browser (35), was the most efficient method to predict potential TFBSs associated with miRNAs. As shown by our results for several TFs, ACTLocater yields a significant improvement in PPV compared to the comparative genomics approach using the same parameters (improvement in PPV could also be observed by comparing selected TF predictions to the UCSC conserved TFBS track, and the benefit of incorporating chromatin structure features could be noticed by comparing these predictions of the UCSC conserved TFBS track before and after filtering by our chromatin accessibility reference; see Supplementary Figure S2). Owing to the accumulation of cell-specific experimental data, computational methods integrating this information have also been developed to identify human TFBSs associated with miRNAs or on a genome-wide scale. For example, a method based on sequence features, histone modifications, and DNase I hypersensitivity has been developed and applied to human CD4+ T cells (15). MITF-regulated miRNAs have also been successfully identified by combining nucleosome information and motif matching in human melanoma cells (38). Furthermore, CENTIPEDE (16) has been used in human lymphoblastoid cells by incorporating DNase I hypersensitivity. In contrast to these cell-specific methods, ACTLocater was based on a chromatin accessibility reference, which was derived from a panel of cell lines and was highly preserved across cell types, making ACTLocater widely applicable.
ChIP-seq has been considered the state-of-the-art experimental technique for profiling TFBSs including species-specific and non-canonical binding sites. Additionally, ChIP-seq has been able to avoid ambiguity for TFs that share similar binding preferences; however, ChIP-seq has to be carried out for only one TF using one set of conditions at a time. Moreover, because of the sequencing depth and performance of data analysis, some true binding sites are likely to be missed (67,68). ACTLocater, by contrast, can evaluate all TFs simultaneously, to provide nucleotide precision and to uncover novel binding sites missed by ChIP-seq assays. Because of these abilities, ACTLocater and ChIP-seq technology could be complementary tools to provide exhaustive information regarding cis-regulatory networks.
Although the performance of our method is encouraging, integrating more source data is likely to improve the ability of ACTLocater. Currently, chromatin accessibility datasets from only 12 cell types were used; we expect that by incorporating more chromatin accessibility datasets from other cell types released by the ENCODE Project (12,13), the coverage of all possible accessible and conserved regions will increase (improvement in the coverage of these regions by increasing cellular datasets can be seen in Supplementary Figure S3), and thereby improve the sensitivity of ACTLocater. Another limitation of ACTLocater is its reliance on binding motifs for the recognition of TFBSs. The human genome encodes about 1400 TFs with sequence-specific DNA-binding properties (69), and binding preferences were only available for a small proportion of these factors. Establishing the binding specificities of these TFs using techniques such as protein-binding arrays and high-throughput SELEX (70) will undoubtedly expand the predictive ability of ACTLocater.
Currently, ACTLocater focuses on the transcriptional regulation of intergenic miRNAs in humans, but modified algorithms based on the same principles could be applied for other genes or on a genome-wide scale as long as the required data are provided. Furthermore, the multiple genome alignments of Drosophila and Caenorhabditis species were already available (35), and the monENCODE Project continues to produce a large number of chromatin accessibility data of D. melanogaster (71) and C. elegans (72). Integrating these datasets, ACTLocater could also be expanded to these species.
Supplementary Data are available at NAR Online: Supplementary Tables 1–13, Supplementary Figures 1–3, Supplementary Methods, Supplementary Dataset 1, Supplementary Software 1 and Supplementary Reference .
The National Basic Research Program from the Ministry of Science and Technology of China [No. 2011CB811300]; the National Natural Science Foundation of China [No. 30830066, 81070589 and 30900820]. Funding for open access charge: the National Natural Science Foundation of China [No. 30830066].
Conflict of interest statement. None declared.
The authors thank Prof. Yan Zhang at Sun Yat-sen University for providing the MDA-MB231 cells and Dr Jason Carroll at Cancer Research UK for providing the FOXA1 ChIP-seq data. The authors thank Xiao-Hong Chen, Qiao-Juan Huang and Yi-Ling Chen for technical assistance.