Background: microRNAs (miRs) are short, single-stranded non-coding RNA molecules which function as post-transcriptional negative regulators of gene expression. miRs act by recognizing complementary target sites in the 3′-UTR of their target genes, and consequently inducing transcript decay or translational arrest of their targets (1
). Complementarity is mediated mainly by nucleotides 2–8 of the 5′-end of the miR, frequently referred to as the ‘seed sequence’ (3
). Each miR can regulate hundreds of genes, and >30% of the mRNAs transcribed from human genes are predicted to be regulated by miRs (4
). During the past 10 years the number of miRs that has been identified expanded enormously, and they were related to numerous biological processes, including development, cell-cycle control, differentiation and apoptosis (5
One of the major difficulties in miR research is to unravel the function of a miR of interest and the pathways it regulates. Since there is no simple and widely used high-throughput experimental method for miR target identification, the amount of available information regarding miRs’ function and their putative target genes is limited. A key factor for inferring the function of a miR is through its target genes. Therefore, several computational algorithms have been developed in the last few years in order to address this problem [such as PITA (6
), TargetScan (4
), miRanda (7
) etc.]. These algorithms are based on a sequence similarity score, conservation and overall stability and accessibility of the miR–mRNA duplex. However, the current sequence-based available target prediction algorithms predict hundreds to few thousands of target genes for each miR, which makes it difficult to focus on a few likely targets of the miR of interest. Moreover, they are known to have high false-positive rates, and their predictions are not in agreement (8
). A common procedure to overcome this problem is to intersect the results of several prediction algorithms in order to obtain a limited number of target genes for each miR, with less false-positive results. However, this procedure misses many bona fide
targets, and hence although it has higher confidence it also has lower sensitivity (9–11
). Although much effort was invested in improving sequence-based predictions [for most recent work see (12–19
)], so far no significant progress has reached consensus.
An obvious problem with sequence-based methods is their generality. These algorithms are not taking into account biological context; for example, the top predicted targets of a certain miR might not be expressed at all in the specific tested model system. Thus, in spite of their high scoring by the sequence-based algorithm, they are not relevant to the specific model system (9
). Our work was designed to address this issue, of context-dependent miR target prediction. It is fairly clear that in order to predict accurately the targets of a miR of interest with high sensitivity and specificity the sequence-based predictions have to be integrated with other kind of information. Since the problem is, on the one hand, unsolved, and on the other it is highly relevant and important, dozens of papers addressing the issue have been published in the last year. Several studies generated miR databases which contain sequence-based information along with lists of validated targets, expression data, signaling pathway resources and literature knowledge mining tools (20–28
). A different approach was based on network analysis to identify signaling pathways associated with miRs (29
Our approach is based on the belief that context-dependent functional targeting of a miR will be reflected in the expression data of its true mRNA targets (31
). Therefore, we integrated another factor into miR target predictions: the correlation between the expression levels of the miR and the mRNAs. Here we propose an algorithm, Context Specific MicroRNA analysis (CoSMic), that combines experimental data from expression of mRNAs and miRs (measured in the same samples) with available sequence-based predictions. Combining these different kinds of information allows us to identify functional targets of miRs that play important roles in a specific experiment. As its output, CoSMic provides information about the statistical significance of the predictions, based on the enrichment of the high scoring sequence-based target genes (4
) by the group of genes whose expression is highly correlated with the miR’s (33
). Hence CoSMic enables us to focus on the most significant candidate miRs for further investigation. Moreover, the number of predicted targets by the algorithm for each miR is only few tens, which is a reasonable number for further experimental validations and investigation. Last, we provide experimental evidence for the efficiency of CoSMic for finding functional miRs and their putative functional targets in a particular system of interest: induction of motility in an EGF-stimulated human mammary cell line. Our algorithm predicts the putative target genes of a miR more accurately and with less false positives than all other algorithms we tested, and allows the identification of functional context-specific target genes.
Brief review of recently developed related methods
Several methods combining sequence-based information with expression data have been developed in the past few years. Here we only list briefly the relevant methods (34–45
)—see Supplementary Data
for a detailed description of each. In 2006, Sood et al.
developed a computational tool, miReduce (34
), which correlates 3′-UTR motifs with changes in mRNA levels, to improve the sensitivity of target predictions. A few years later Dongen et al.
introduced Sylamer (35
), a method for detecting targets of a miR from expression data by assessing over or under representation of its seed region in the 3′-UTRs of a gene list, ranked by an expression-based criterion. Next, two other algorithms which integrated gene expression into their predictions were published—Sigterms (36
) and CORNA (37
). Both use the set of differentially expressed genes from a specific experiment, and perform enrichment analysis to determine whether this set of differentially expressed genes is enriched for targets of a particular miR (according to one of three sequence-based target prediction algorithms: TargetScan, PicTar or MiRanda). In 2010, Ulitsky et al.
introduced FAME (38
), a permutation-based statistical method that tests for over or under representation of miR targets in a set of co-expressed genes. All these algorithms (miReduce, Sylamer, Sigterms, CORNA and FAME) utilize only mRNA expression data and do not take into account miR expression. The potentially important association between the miR and mRNA expression levels is not used, and hence they lose key information which provides statistical evidence for a regulatory relationship between the miR and its putative mRNA targets. Moreover, Sigterms and CORNA use only the list of differentially expressed genes and disregard the level of change or profiles of gene expression.
The first algorithm that integrated both mRNA and miR expression data into the sequence-based prediction was GenMiR++ (39
). GenMiR++ is a Bayesian model and learning algorithm, designed to explore functional miR targets. The algorithm outputs the posterior probabilities of whether a given miR putatively targets a given mRNA under the GenMiR++ model. Two other algorithms that also exploit the full-expression matrices of both the mRNA and miR expression data are MMIA (40) and MAGIA (41
). In general, both algorithms intersect the group of predicted target genes of a specific miR (using one of the available sequence-based algorithms: TargetScan, PITA and PicTar) with the group of genes with inverse expression (MMIA) or anti-correlation (MAGIA) to the miR. Thus, both MMIA and MAGIA use sharp cutoffs for the statistical analyses and intersect the group of predicted target genes with the group of anti-correlated genes. Using this rigid approach might lose some putative targets, merely due to setting thresholds at some arbitrarily selected value. In addition, MMIA is suitable only for experiments with two conditions (e.g. control versus treatment), and therefore it is not applicable for datasets with more conditions (such as time-course experiments). During the last year several additional algorithms combining sequence-based target prediction with expression data were developed. Jayaswal et al.
) proposed a two-step method for the identification of miRs–mRNAs relationships; the first step is the identification of miR and mRNA clusters and the second step is the estimation of association between the two types of clusters. Li et al.
) suggested a computational approach to construct association networks between miRs and mRNAs, using partial least square (PLS) regression, without respect to any sequence-based prediction information. Lu et al.
) proposed a linear regression model to investigate one mRNA simultaneously regulated by multiple targeting miRs, with respect to their potential competition in binding sites. All these authors suggested approaches to improve target prediction, but they did not implement their methods and hence there is no tool readily available for the biologist to explore his own experimental data. Moreover, the results of these algorithms were not validated experimentally. Another algorithm, developed by Bang-Berthelsen et al.
), is based on independent component analysis (ICA) that incorporates both seed matching and mRNA expression profiling. Bang-Berthelsen et al.
do not consider miR expression data and hence, as CORNA and Sigterms, lose key information about the miR–mRNA regulation. In addition, no implementation is available for the biologist user.
We have made explicit comparisons of the predictive power of CoSMic with purely sequence-based predictions and with the five algorithms that use also expression data: GenMiR++, MAGIA, FAME, miReduce and Sylamer.
The added value provided by CoSMic over the other prediction algorithms that combine sequence-based information with expression data is summarized as follows:
First, CoSMic differs from the other methods in that it initially identifies miRs that play active roles in the specific biological system of interest, in addition to the identification of their functional and context-specific target genes. This feature is important when no prior knowledge is available about the miRs that play active and significant roles in the system of interest, and CoSMic may direct the biologist towards them.
Second, we provide experimental validation for CoSMic results, both for the identification of the significant miRs in a particular system (EGF-induced motility in a human breast cell line) and for their functional targets.
Third, the thresholds used by our algorithm are data driven; there is no sharp cutoff on the correlation or intersection of predicted target genes with correlated genes, instead we optimize a gene set enrichment procedure to get the group of correlated genes that are enriched in the sequence-based predictions. In addition, as opposed to the other prediction tools that combine sequence-based information with expression data, CoSMic takes into consideration in the enrichment analysis not only the identities of the genes identified as targets of a miR, but also their corresponding sequence-dependent scores.
Fourth, we implemented our algorithm as an easy-to-use stand alone software to allow biologists to apply it for analysis of their data.
Last, our algorithm considers not only negative correlations, but also positive correlations as an indicator for miR–mRNA direct regulation.
Thus, we offer CoSMic and the corresponding experimental design () as a global strategy for unveiling the functional significance of miRs in a given biological system. The CoSMic algorithm is freely available at http://www.weizmann.ac.il/complex/compphys/software/cosmic/
(27 August 2012, date last accessed).
Figure 1. Flow chart of the experimental design. (A) Dataset of coupled mRNA and miR expression measurements from the same samples. Using predefined thresholds of expression across all samples and fold change, we filter genes and miRs that are expressed and changed (more ...)