|Home | About | Journals | Submit | Contact Us | Français|
While it has been established that microRNAs (miRNAs) play key roles throughout development and are dysregulated in many human pathologies, the specific processes and pathways regulated by individual miRNAs are mostly unknown. Here, we use computational target predictions in order to automatically infer the processes affected by human miRNAs. Our approach improves upon standard statistical tools by addressing specific characteristics of miRNA regulation. Our analysis is based on a novel compendium of experimentally verified miRNA-pathway and miRNA-process associations that we constructed, which can be a useful resource by itself. Our method also predicts novel miRNA-regulated pathways, refines the annotation of miRNAs for which only crude functions are known, and assigns differential functions to miRNAs with closely related sequences. Applying our approach to groups of co-expressed genes allows us to identify miRNAs and genomic miRNA clusters with functional importance in specific stages of early human development. A full list of the predicted mRNA functions is available at http://acgt.cs.tau.ac.il/fame/.
MicroRNAs (miRNAs) are small (19–25nt), non-coding RNAs that can reduce the abundance and translational efficiency of mRNAs, and play a major role in regulatory networks, influencing diverse biological phenomena (1). In metazoans, this repression is generally conferred by the binding of miRNAs to the 3′ UTRs of their targets. Some miRNAs were shown to repress translation of anywhere from tens to hundreds of mRNAs (2,3). Several miRNAs have been shown to affect multiple members of the same pathway (4–7). Determining the role of individual miRNAs in cellular regulatory processes poses a major challenge. The function of the vast majority of miRNAs is currently unknown, and even for relatively well studied miRNAs, only a handful of targets have been rigorously characterized. Knock-out studies in model organisms have had only limited success in delineating miRNA function, possibly because redundant miRNAs exist at saturating levels in wild-type cells, or because of compensatory effects in downstream signaling pathways (8).
Analyzing properties of miRNA targets is a promising approach to predicting miRNA function. A large number of algorithms for sequence-based prediction of miRNA targets have been described in the literature [reviewed in (9)]. As the number of validated targets is currently limited, methods for target-based inference of miRNA function must rely on these predictions. If the targets of a specific miRNA are enriched with genes annotated with some biological process or pathway, it is reasonable to infer that the miRNA is involved in the same process. This suggests the following simple algorithm for genome-wide inference of miRNA function: Predict that a miRNA is involved in every process/pathway for which the number of miRNA targets taking part in the process is statistically significant. Several studies used this approach. Gaidatzis et al. (10) applied a log-likelihood test to look for enrichment or depletion of targets of specific miRNAs in KEGG pathways. Similar algorithms using Gene ontology (GO), KEGG and BioCarta pathways were implemented in miRgator (11) and SigTerms (12), both of which evaluate statistical significance using a hypergeometric (HG) test.
The previously described methods for target-based miRNA function prediction have three limitations. First, none of them was systematically tested for its ability to recover known miRNA functions. Second, they treat equally all predicted miRNA targets. Several recent studies have shown that, in fact, numerous factors, such as the local context within the 3′ UTR and the relative distance from the stop codon, influence the efficacy of individual miRNA target sites (13,14). These studies also offered ‘context scores’ for ranking the predicted targets of each miRNA. Third, existing methods do not take into account the very uneven distribution of 3′ UTR lengths; for example, the 3′ UTRs of genes expressed in brain and neural systems are almost twice as long as those of other genes (Figure 1B), and proliferating cells express genes with relatively short 3′ UTRs (15). Accordingly, genes highly expressed in the neural lineage harbor more predicted miRNA target sites (Figure 1B). It is not surprising, therefore, that the pathway most commonly predicted by Gaidatzis et al. (10) was ‘axon guidance’; and that many seemingly unrelated miRNA families, such as mir-17, known to be primarily involved in cell-cycle regulation (6), and myeloid cell differentiation (16), were predicted by them to be related to neuronal pathways.
miRNAs frequently appear as co-localized clusters that are also co-expressed, and even transcribed as a single polycistron. Therefore, it is likely that co-localized miRNAs share similar functions, which may be revealed by a joint analysis of their targets. Xu and Wong (17) used the HG test followed by random resampling to look for over-representation of miRNA cluster targets in BioCarta pathways. Their analysis identified the mouse miR-183–96–182 cluster as a regulator of the insulin-signaling pathway. However, their method did not take into account how many of the miRNAs in the cluster regulate each pathway member.
Here we introduce FAME (functional assignment of miRNAs via enrichment), a new permutation-based statistical method that tests for over- or under-representation of miRNA targets in a designated set of target genes (Figure 1A). Unlike previous studies, FAME utilizes weights (confidence values) for miRNA-target pairs, accounts for the number of miRNAs regulating each target, and can be used for analysis of any group of miRNAs. Here, we focus on three main applications of FAME: direct inference of miRNA function using sets of genes sharing a common annotation, indirect inference of miRNA function using matched miRNA/mRNA expression data, and prediction of function for genomic clusters of miRNAs.
In order to compare our method to other methods and to test its ability to recover known miRNA functions, we assembled a compendium of 83 experimentally validated miRNA-function associations. We show that our method is superior to the currently used HG test. We describe the novel functions suggested by FAME for several miRNA families, and show how it can be used to refine miRNA function in cases where only a relatively general miRNA function is known. We focus in particular on two families with similar seed sequences, mir-17 and mir-106/302, and demonstrate that they are likely to have both shared and unique functions.
Analysis of enrichment or depletion of miRNA targets in a set of co-expressed genes is an indirect yet potent way of providing clues to miRNA function. Previous studies used it to identify significant impact of miRNAs on tissue-specific gene expression patterns (18). Motif finding in 3′ UTR sequences has also been used for this task (19). We use our method to identify 68 miRNA families and 27 genomic clusters regulating 21 gene co-expression clusters in diverse human stem cell lines. Clusters enriched with the targets of a specific miRNA tend to be anti-correlated with the miRNA expression, whereas clusters depleted of miRNA targets are co-expressed with it. Finally, we use FAME’s results to predict novel miRNA functions related to stem cell biology. We hypothesize that two miRNAs of unknown function, mir-499 and mir-544, play a pivotal role in early development.
An implementation of FAME with a graphical user interface is available as part of the Expander 5.0 microarray data analysis suite (60) (http://acgt.cs.tau.ac.il/expander). This implementation supports analysis of over- and under-representation of miRNA targets in gene sets from human, mouse, fly and worm. In addition, a full list of the GO ‘biological process’ terms and KEGG pathways predicted to be targeted by each human miRNA appear in http://acgt.cs.tau.ac.il/fame/. This website also allows the display of the targeted genes as part of KEGG pathway maps.
Human miRNA target predictions for conserved miRNA families were taken from the TargetScan 5.0 database (21). Following the suggestion in (20), conserved target sites were used when testing for enrichment of miRNA targets, and both conserved and non-conserved predicted target sites were used when testing for depletion. Lengths of 3′ UTR sequences were taken from the TargetScan website. Spurious enrichments were avoided by filtering out targets with similar 3′ UTRs. If two 3′ UTRs shared an identical subsequence of at least 75nt, and >80% of the miRNAs predicted to regulate them, only the gene with the longer 3′ UTR was retained in the target set. A total of 101 genes were filtered out.
The bipartite graph G=(M,T,E,W) was constructed as follows. The context scores for all miRNA-target sites in the database were ranked and normalized to the range (0,1). For each miRNA m and target t, for each target site reported in TargetScan, an edge (m,t) was added to E with a weight of 1+(k·a·b), where k is the relative rank of the highest ranking conserved target site of m in t, and a and b are two parameters. Using this weighting scheme, the edge weights attain a discrete values w1,…,wa, and the parameter b controls the relative contribution of the context scores to the enrichment/depletion significance. We used a=5 and b=3 throughout this study, but got similar results when these parameters were altered (Supplementary Figure S3). In particular these tests showed that using context-score weights (b>1) improved the performance of FAME (Supplementary Figure S3).
The random graphs are generated by performing, for each possible edge weight wi, a long sequence of independent edge shuffle operations (22), which preserves the number of edges with weight wi incident on each node. Throughout the study, the total number of random edge shuffles for each random graph was set to 5·|E|.
Human GO annotations were taken from the Entrez Gene database. GO annotations from the ‘biological process’ namespace with between 10 and 2000 predicted miRNA targets were used. In order to remove redundancy, we filtered out terms that differed by less than four genes, retaining in the dataset only the GO set that was assigned to more genes. Applying these filters resulted in 1499 GO sets.
The miRNA expression data are described in (23). A total of 705 miRNAs were profiled using the Illumina human miRNAs version 1 microarray. A total of 21060 mRNAs were profiled using Illumina human WG-6 version 1 microarrays. Briefly, the data compare gene expression in 26 cell lines representing 16 cell types, including five embryonic stem cell (ESC) lines, five fetal neural stem cell (fNSC) lines, four adult surgery derived neural stem (aNSC) lines, one extraembryonic endoderm-like (XE) cell line differentiated from the WA09 hESC line, two glial cell lines, three fibroblast cell lines, two mesenchymal stem cell (MSC) lines, two umbilical cord vein (UCV) endothelial cell lines and two choriocarcinoma cell lines. For each cell line, the same RNA preparations were used to generate the mRNA and miRNA expression data. miRNAs were profiled in quadruplicate, and mRNAs were profiled in duplicate. All replicates were averaged prior to subsequent analysis.
We extracted the 4000 genes with the highest variance among the genes that had at least two samples with at least 3-fold difference over the minimal level across the profiles. These genes were clustered using CLICK (24), which resulted in 21 clusters. Assignment of genes to clusters is presented in Supplementary File S4.
Following (23,25), we defined a genomic cluster of miRNAs as a maximal segment such that every two consecutive miRNAs were separated by <50000bp. Genomic positions of miRNAs were taken from MiRBase (26). We considered only clusters that contained representatives from at least two different TargetScan families. Finally, we united any pair of clusters that contained exactly the same set of TargetScan families. This resulted in 27 clusters containing from 2 to 27 distinct TargetScan families, with an average of 3.3 families per cluster (Supplementary Table S1).
Our goal was to compute the significance of the overlap between a given set of predicted miRNA targets and a designated set of genes. Ideally, the computation should account for the strength of each predicted miRNA-target pair (or for our confidence in its biological relevance), and for the number of miRNAs regulating each gene (Figure 1A). FAME constructs a weighted directed bipartite graph G=(M,T,E,W) in which miRNAs (M) are connected with their predicted targets (T). An edge (m,t) appears in E for every target site for m that appears in the 3′ UTR of t (hence, parallel edges between the same pair are possible). We used TargetScan 5.0 for prediction of miRNA targets, as it was recently shown to be superior to other target predictors (2). miRNAs that belong to the same TargetScan family (a set of miRNAs sharing the same seed sequence) are grouped together into a single node in M. T contains a node for each gene (represented by an Entrez Gene entry) that is predicted to be targeted by at least one miRNA. Edges in E are assigned discrete edge weights based on the TargetScan context scores (‘Materials and Methods’ section). Spurious enrichments were avoided by excluding from T genes with similar 3′ UTR sequences (‘Materials and Methods’ section). As proposed in (18), we used only evolutionarily conserved miRNA target sites when testing for over-representation, and both conserved and non-conserved sites when testing for under-representation.
Following the construction of G, we used degree-preserving permutations to generate N random graphs G1,…,GN, in which, for each possible edge weight w, the number of outgoing edges with weight w from each miRNA and the number of incoming edges with weight w for each target were the same as in G (‘Materials and Methods’ section). We used N=10000 throughout this study. These graphs were used for evaluating the significance of the overlap between the targets of a set M′ of miRNAs (in this study, a TargetScan family or a set of families represented in a genomic miRNA cluster) and a set of targets T′ (e.g.: a set of genes sharing a GO annotation). Let WG(M′,T′) be the total weight of the edges connecting M′ and T′ in G. We compared WG(M′,T′) with all WGi(M′,T′), and computed an empirical P-value and a z-score for (M′,T′). All the (M′,T′) pairs were then ranked by their P-values, and FDR was assessed by the Benjamini–Hochberg procedure (27).
Rigorous evaluation of any prediction algorithm requires a ‘gold standard’: in our case a set of miRNAs with known functions. As we know of no available resource describing validated miRNA functions, we carried out an extensive literature survey and constructed a compendium of miRNAs with experimentally established functions in mammals. We included in the compendium only cases in which at least one target relevant to the pathway or function was experimentally validated (i.e. functions suggested based solely on phenotypes resulting from the perturbation of the miRNA were not included). In each case, we manually assigned the KEGG pathway and GO annotation that was closest to the reported function. The compendium, with references to the original publications, appears in Supplementary File S1. It contains a total of 31 miRNA–KEGG pathway associations and 52 miRNA–GO set associations.
Enrichment or depletion of miRNA targets in a set of genes involved in a specific process or pathway is the most direct clue to miRNA function prediction. We used two data sets to test this approach, one based on pathways taken from KEGG, and one based on GO. Each TargetScan miRNA family m was tested for over-representation of its targets in each KEGG pathway or GO annotation set that contained at least three targets of m.
We first describe the results on KEGG pathways. Using the compendium, we compared FAME with the HG test, and with the log-likelihood ratio (LLR) scores used by Gaidatzis et al. (10). For each miRNA m associated with a KEGG pathway P in the compendium, we ranked all 140 tested KEGG pathways according to the significance of their enrichment with the targets of m (Figure 2A). The success of each method in predicting a specific function was measured by the rank of P in this list. Eighteen compendium miRNA–pathway pairs met the criterion of at least three genes in P being predicted targets of m, and they were ranked by each of the three methods. In six cases the known pathway corresponded to the top FAME prediction, compared to just four cases when the HG test was applied, and three cases when the LLR test was used (Figure 2A). The average position of the known function across all the 18 pairs was higher for FAME than for the HG and LLR tests (Figure 2B), although the difference was not statistically significant, perhaps due to the small size of the compendium. Performance of the HG test was similar when only the top 25, 50 or 75% of the miRNA–target pairs (as determined by the context score) were used (Supplementary Figure S1A), and it never placed more than four correct pathways as top predictions (results not shown).
The top KEGG pathway predictions for each miRNA family are shown in Table 1. This analysis allowed us to predict novel functions for a number of miRNAs:
While the results with KEGG were promising, the specificity of KEGG pathways is rather limited, and some biological processes, such as development, are poorly represented. A much more comprehensive repository of gene sets in human is GO (http://www.geneontology.org). Since sets of genes sharing a GO annotation (henceforth referred to as GO sets) frequently overlap, and can be very general or very specific, we focused on 1499 non-redundant GO sets, containing between 10 and 2000 genes (Supplementary File S2, see ‘Supplementary Methods’ for details). The predictions of GO annotations for miRNA families appear in Supplementary File S3. For 36 compendium miRNA–GO set pairs, the GO set contained at least three predicted miRNA targets, and these pairs where used further (Supplementary File S1). The average ranking of the known miRNA–GO set pairs was higher when using FAME than when using the HG or the LLR tests (Figure 2C, Supplementary Figure S2). When we considered only pairs for which the known function was ranked in the top 10%: FAME significantly outperformed the HG test and the LLR test (P=0.015 and 0.002 respectively, Figure 2D). Once again, performance of the HG test was not altered by using only the top 25, 50 or 75% of the predictions (Supplementary Figure S1B–C).
Since our compendium consisted of relatively broad and non-specific functions, representing the current limited knowledge of miRNA functions, its precision for the evaluation of the performance of FAME is limited. It is possible that related, but more specific, functional terms that correspond to the real function of the miRNA were ranked higher than the compendium functions. Indeed, manual inspection of the results suggested several such cases (Table 2). For example, mir-146 was shown to be involved in the innate immune response (33,34), and therefore labeled ‘immune response’ in our compendium. However, only two genes annotated in GO with ‘immune response’ are predicted by TargetScan to be regulated by mir-146. One of the top FAME predictions for mir-146 is ‘I-κB kinase/NF-κB cascade’, a pathway that contains three mir-146 targets (CARD10,IRAK1 and TRAF6). Indeed, mir-146 was shown to affect the activity of the NF-κB pathway (35). Interestingly, the expression of mir-146 was also shown to be regulated by NF-κB (34), suggesting that mir-146’s function in immune response is regulated by a feedback loop. In another example, mir-205 was shown to regulate epithelial-to-mesenchymal transition (EMT) by targeting the transcription factors ZEB1 and ZEB2 (36). FAME predicts that mir-205 regulates ‘establishment or maintenance of cell polarity’ (ranked 12th). Loss of apical-basal polarity is one of the key steps in EMT (37). Notably, ZEB1 and ZEB2 are not annotated with this term in GO, despite the fact that several polarity-related genes, such as CRB3, PATJ and LGL2, are known ZEB1 targets (38). FAME prediction thus suggests that mir-205 regulates EMT mainly through regulation of apical-basal polarity genes, both by direct repression and via ZEB1 and ZEB2.
miRNA genes tend to appear in multiple copies in the genome, and can be grouped into families sharing similar mature sequences. According to the currently accepted model, the ‘seed sequence’, nucleotides 2–8 of the mature miRNA sequence, is the main determinant of miRNA targeting specificity (2,21,39). Several miRNA families share similar, but not identical, seed sequences. We previously observed that at least 18 different miRNAs that have the AAGUGC hexamer in their seed sequence are highly expressed in ESCs, (23). These miRNAs belong to two TargetScan 5.0 families: mir-17-5p/20/93.mr/106/519.d (henceforth referred to as mir-17, seed sequence AAAGUGC), and mir-106/302 (seed sequence AAGUGCU). Since TargetScan predictions are based on the conservation of seed matches, and these seeds overlap, TargetScan predicted numerous common targets for both families. However, the question of whether these two groups have entirely identical functions has not been resolved. Several members of both families have been studied and some evidence exists on their function. Members of both the mir-17 and mir-302 families were found to regulate the G1/S cell-cycle checkpoint (6,40–45) and the TGFβ-signaling pathway (46). Deletion of mir-17 in mice led to inhibited B cell development (16), and mir-17 was shown to control monocytopoiesis by targeting RUNX1 (47). Recently, the mir-302 family was shown to regulate the mesendodermal cellular fate specification, and repression of this family in human ESCs inhibited the formation of neuroectoderm during embryogenesis (48).
FAME predictions for the two families appear in Supplementary File S3. FAME predicted that both families regulate cell-cycle progression: ‘regulation of cell cycle’ was enriched in both target sets (P=0.5.0×10−4 for mir-17 and P=0.00465 for mir-106/302). mir-17 targets were more significantly enriched for ‘negative regulation of progression through cell cycle’ (P=4.5×10−4), concordant with the role mir-17 plays in accelerating progression through cell cycle (6). However, additional FAME predictions point to differences in the developmental functions of the two families. The targets of mir-17 but not mir-106/302 were enriched for ‘regulation of myeloid leukocyte differentiation’ (P=0.0799 versus P=0.503), with five mir-17 targets annotated with this GO term. In contrast, only the targets of mir-106/302 were enriched with ‘central nervous system neuron differentiation’ (P=0.011 versus P=0.25 for mir-17). These findings suggest that in addition to their common role in cell-cycle regulation, the two families have distinct roles in cell differentiation: mir-17 regulates development of leukocytes while mir-106/302 regulates development of the nervous system.
Identifying over- or under-representation of miRNA targets in a set of co-expressed genes is an indirect but powerful method for studying miRNA function (18). Here, we utilized this approach to study a collection of ~130 simultaneous miRNA and mRNA expression profiles from cell lines designed to identify regulatory pathways critical for self-renewal, pluripotency and differentiation of human stem cells (‘Materials and Methods’ section) (49). We refer to this collection as the stem cell data set (SCD). SCD contained expression profiles of ESCs, fNSCs, aNSCs, glia cells, fibroblasts, MSCs, umbilical vein endothelial cells, and two choriocarcinoma cell lines. We clustered the mRNA expression patterns in SCD using CLICK (24) and obtained 21 clusters of co-expressed genes (‘Materials and Methods’ section, Supplementary File S4). We tested each cluster and each miRNA for enrichment and depletion of miRNA targets using FAME and the HG test.
According to the currently accepted model, miRNAs function as repressors of gene expression, and thus their expression patterns are expected to be anti-correlated with those of their targets (50,51) and correlated with their anti-targets [genes depleted of miRNA target sites (18)]. However, it is often difficult to identify such effects in matched miRNA/mRNA expression datasets. For example, in the SCD, the average Pearson correlation between expression of miRNAs and their TargetScan targets is 0.009. The uneven 3′ UTR lengths of genes highly expressed in different stem cell types (Figure 1B) could be one of the reasons for this observation. In order to test this, we analyzed the results of FAME and the HG test using the miRNA expression data in SCD. When we detected over-representation of miRNA targets in a cluster, we tested whether the miRNA expression pattern and the average expression pattern of the mRNA cluster were significantly anti-correlated (Figure 2C). Similarly, we tested cases of miRNA target depletion for a significant positive correlation. In cases where the miRNA family contained more than one miRNA with expression data in SCD, we chose as a representative the miRNA that had the highest absolute value of expression correlation with the cluster. We found that the most significant enrichments identified by FAME were consistently better supported by the miRNA expression data (Figure 3): FAME yielded evidence of a significant positive correlation of miRNAs and sets of genes depleted of their targets in 23% of the cases, and evidence of a negative correlation of miRNAs and their targets in 18% of the cases. These results suggest that miRNA target depletion is more effective than enrichment in identifying functionally relevant miRNAs using co-expression data. Indeed, as described below, for several miRNAs with a known function in specific differentiation-related processes, we found evidence of depletion of target sites in genes expressed during the same developmental stage, but no evidence of enrichment of target sites in genes expressed at other stages.
In addition, we evaluated the correlation between the enrichment of miRNA targets in a cluster and the similarity of the expression patterns of the miRNA and the cluster. To this end, we used the P-values computed using FAME and the HG test to assign every miRNA–cluster pair with a relative rank of their enrichment P-value (highest ranks were assigned to pairs in which the targets of the miRNA were most significantly enriched in the cluster). Using FAME, we found a strong negative correlation between the enrichment rank and the similarity of the gene expression patterns (evaluated using Pearson correlation, r=−0.27). The same correlation was significantly weaker when using the HG-test (r=−0.065, P=0.026 for the difference between the two correlation coefficients). Similarly, the correlation between the significance of the depletion of miRNA targets and the similarity of gene expression profiles was significantly higher when using FAME compared to the HG test (r=0.037 versus r=−0.019, P=0.0377 for the difference between the two correlation coefficients).
By combining weak signals coming from individual miRNAs in a genomic cluster, one can uncover the function of the whole cluster (17). To test this concept we identified genomic clusters of miRNAs in the human genome (Supplementary Table S1; ‘Materials and Methods’ section) and repeated the analysis of the 21 SCD co-expression clusters using the genomic miRNA clusters.
Overall, at FDR<0.1, we identified enrichment or depletion of targets of 68 miRNA families and 27 genomic clusters in 21 co-expression clusters (Figure 4A). Of the 68 miRNA families, 16 are known to be related to stem cell biology [out of 25 stem cell-related families taken from (52), P=0.027], indicating that our cluster-based analysis is capable of revealing functionally relevant miRNAs. In comparison at FDR<0.1, the HG test reported significant enrichment or depletion for 77 miRNA families, but the overlap with the stem cell-related families was not significant (P=0.344).
Analysis with FAME revealed several known miRNA regulations that are supported by miRNA expression data: miR-9 and miR-124 targets are enriched in cluster 2, which is downregulated in fNSCs (and also in ESCs), and miR-9 targets are depleted in cluster 3 (Figure 4B). Targets of miR-17 family were enriched in clusters 7 and 8, which show low expression in ESCs, and depleted in clusters 1 and 13, which are upregulated in ESCs (Figure 4C). Targets of mir-106/302, which share the AAGUGC hexamer with mir-17 (see below), were also enriched in clusters 7 and 8, albeit less significantly (P=0.0603 and P=0.0056, respectively, FDR<0.1). Accordingly, miR-9 and miR-124 miRNAs were upregulated in fNSCs, and miR-17 and related miRNAs were strongly upregulated in ES cells [Figure 4B and C, (23)]. Interestingly, we identified more significant enrichment (compared to using individual miRNA families) with the four genomic clusters that contain members of the miR-17 and miR-106/320 families: gc:17-92a, gc:371-527, gc:106b-93 and gc:302a-367 (Figure 4C). These findings underscore the power of analyzing genomic clusters of miRNAs in addition to analyzing individual miRNAs.
miR-145 was recently shown to be an important regulator of differentiation by repressing the key ESC transcription factors OCT4, SOX2 and KLF4 (53). Increased miR-145 expression inhibited hESC self-renewal and induced lineage-restricted differentiation (53). In our data, the expression of miR-145 was strongly induced in differentiated cells, but also in two of the ESC lines (H1 and HSF6, Figure 4D), which may reflect their heterogeneity. FAME identified a significant depletion of miR-145 targets in cluster 8, which is upregulated in ESCs and in fNSCs (P=0.0091). In addition, the targets of the gc:143-145 cluster, which contains miR-145 along with miR-143, were even more significantly depleted in cluster 8, which contains genes downregulated in ESCs (and therefore upregulated following ESC differentiation, P=3.0×10−4). This suggests that mir-143 is also likely to be related to ESC differentiation. We found similar depletions for mir-499 and mir-544 families (P=0.0023 and P<1.0×10−4, respectively), both of which are also downregulated in ESCs (Figure 4D), suggesting that these families play an important role during ESC differentiation and early human development.
FAME identified significant depletion of the targets of let-7, mir-125 and genomic clusters containing these miRNAs, in cluster 3, which contains genes upregulated in fNSCs. Depletion of let-7 targets in brain-specific genes was reported earlier (18). Interestingly, members of the let-7 family were expressed at similar levels in the various non-ESC lines in our data set, but significant depletion of their targets was observed only in the cluster upregulated in fNSCs. We also identified significant enrichment of let-7 targets (but not of mir-125 targets) in cluster 5, which is upregulated in one of the choriocarcinoma lines (BEWO). Consistently, all members of the let-7 family were strongly downregulated in this cell line (Figure 4E). let-7 miRNAs are known tumor suppressors downregulated in various cancers (54). As members of the let-7 family appear in multiple genomic locations, our results suggest that their repression, either through coordinated events of transcriptional regulation or post-transcriptionally, leads to upregulation of their targets, which may contribute to malignancy in choriocarcinoma.
gc:134-758 is a large miRNA cluster of unknown function (55) located on chromosome 14, and is significantly downregulated in ESCs (23) (Figure 4F). We identified significant depletion of the targets of gc:134-758 in cluster 2, upregulated in differentiated cells, and in particular in aNCSs. The targets of two additional miRNA clusters, gc:181c-27a and gc:23b-27b, which were also downregulated in ESCs (Figure 4F), were also significantly depleted in this cluster. These results suggest that the main function of all three clusters takes place in differentiated cells and in aNSCs (rather than in fNSCs).
One of our goals in designing FAME was to overcome the bias introduced by the variability in 3′ UTR lengths. The average 3′ UTR lengths in the SCD varied greatly: the average 3′ UTR length of genes in cluster 2 (genes up regulated in fNSCs) was 2342, compared to just 1160 in cluster 2. In order to test whether FAME was able to alleviate this bias, we divided the 21 clusters into four bins based on average 3′ UTR length and compared the total number of significant enrichments (FDR<0.1) in each bin (Figure 5). The number of significant enrichments found with the HG test was correlated with UTR length (r=0.59); this correlation was significantly reduced with FAME (r=0.18).
We have presented FAME, a novel method for detecting enrichment or depletion of miRNA targets in sets of genes. This method has two main applications at the present time: direct inference of miRNA functions using sets of co-annotated genes, and prediction of miRNA-based regulation using mRNA expression data. To allow rigorous evaluation of FAME in the first task, we assembled a compendium of 83 miRNA–pathway and miRNA–process pairs. To the best of our knowledge, this is the first time the performance of such a method has been rigorously validated against experimentally tested functions. While it is still quite modest, this compendium can also be useful for evaluating future approaches to miRNA function prediction, and will improve as experimental evidence on miRNA function accumulates. Our compendium complements an existing database that lists the involvement of miRNAs in human disease (56).
The use of functional annotations of predicted targets for inference of miRNA function is a promising concept, but algorithms for this problem must cope with numerous obstacles, including the limited accuracy of miRNA target prediction methods, biases in 3′UTR length and composition, limitations in the existing systems for functional annotations, and the fact that most miRNAs have only a limited effect on the expression levels of their targets (2,3). A wealth of methods have been developed for cis-regulatory motif finding, typically applied to promoter sequences of co-expressed genes (19,57,58), but more recently also to 3′ UTRs (19,59). Some of these methods address key issues affecting transcription factor and miRNA binding, such as GC-content and distance from the transcription start site. However, they do not address key in miRNA target analysis, such as the number of miRNAs targeting each 3′ UTR and the influence of the 3′ UTR context around the miRNA target site. The results described here suggest that our statistical analysis is capable of overcoming some of these difficulties as it correctly infers the miRNA function in many cases. Such analysis can be improved further in the future by directly addressing differences in the composition of the 3′ UTRs.
Prediction of relatively low-resolution functions (e.g. on the level of KEGG pathways) seems to be easier than prediction of the precise biological process affected by each miRNA (represented by a GO term). However, assessment of the success of FAME on the latter task is limited by the currently crude knowledge of miRNA functions. Our analysis is also limited by the quality of the target prediction algorithms. The most successful predictors use information about target site conservation, but they miss targets (and therefore functions) that are less well conserved. In addition, it is currently difficult to efficiently predict functional miRNA target sites in coding sequence, even though a considerable fraction of them is estimated to regulate gene expression (2,3).
Data describing miRNA and mRNA expression in the same samples are now collected in multiple systems. We suggest the following three-step strategy for analysis of such data: (i) clustering of mRNA expression, (ii) detection of over- and under-representation of miRNA targets in clusters using FAME and (iii) analysis of the correlation between the expression patterns of the miRNAs and of the mRNA clusters in which they are implicated. As we show, this analysis recovers a significant number of miRNAs with key roles in the studied system. If the expression data describe a comparison between two biological conditions, differential expression analysis (e.g. using t-test) can substitute for the clustering of the mRNA patterns. Both types of analysis are made possible by our implementation of FAME as part of the Expander 5.0 microarray data analysis suite (http://acgt.cs.tau.ac.il/expander/). Using Expander, which contains pre-compiled TargetScan 5.0 target predictions, it is possible to load expression data, identify co-expressed or differentially expressed genes, and use FAME to detect over- or under-representation of miRNA targets [see the guidelines for using FAME in Expander in (60)]. Our implementation is quite efficient and analysis of 21 co-expression clusters with 1000 random iterations (including over 150 million network rewiring operations) takes 20min on a standard laptop PC with 2.6GHz processor and 2GB of RAM. We believe that this type of analysis will be of immense value for future joint mRNA/miRNA expression studies.
Supplementary Data are available at NAR Online.
Edmond J. Safra Bioinformatics program at Tel Aviv University, Legacy Heritage Fund and EMBO long-term fellowship (to I.U.); Israel Science Foundation (grant no. 802/08); European Community’s Seventh Framework Programme (TRIREME project, grant HEALTH-F4-2009-223575); Wolfson Family Charitable Trust. Funding for open access charge: Israel Science Foundation (grant no. 802/08); European Community’s Seventh Framework Programme (TRIREME project, grant HEALTH-F4-2009-223575); Wolfson Family Charitable Trust; NIH 5K12HD001259-10 Women's Reproductive Health Research Career Development Program (to L.C.L.).
Conflict of interest statement. None declared.
The authors thank Chaim Linhart for helpful discussions and comments on an early version of this manuscript.