While the ability to predict miRNA targets represented a great advancement in the field and has helped to identify many miRNA targets, the false positive rate of all of the algorithms is high. This hampers accurate prediction of miRNA function based solely on the list of predicted targets. While it is well established that the co-expression of miRNAs which are part of the same transcriptional unit is often indicative of a common biological function, it has not been consistently so, and also has not permitted accurate grouping of miRNAs according to their function. We now present a new method to group miRNAs, which relies on identification of co-expressed effector genes for which a large amount of functional data is available.
Our analysis is based on three premises: 1) Cancer-relevant miRNAs act in pathways that regulate cellular differentiation stages which can be recognized by the signature of expression of a number of stage-specific marker genes (i.e. E-cadherin
for epithelial cells or early carcinomas). 2) A substantial change in the expression of these marker genes is brought about by a relatively small change in the expression of genes that are actual miRNA targets (i.e. ZEB1/2
). MiRNAs therefore are part of amplification loops. 3) A cancer-relevant differentiation stage is characterized as much by expression of positively correlating marker genes (i.e. E-cadherin
) as by negatively correlating ones (i.e. Vimentin
) which are not direct targets of the correlating miRNA (i.e. miR-200
does not target Vimentin
). Our method is complementary to existing methods and, in conjunction with prediction algorithms and improved methods for confirmation of actual in vivo
targeting events 
, should allow better prediction of the cancer relevant function of miRNAs.
A number of studies have addressed the question of miRNA function using biocomputational approaches. These approaches have all used target predicting information based on sequence complementarities (e.g. TargetScan).
In one early work, a computational strategy to predict miRNA regulatory modules (MRMs) was developed using bipartite graphs to model miRNA-mRNA binding structures 
. An extension of the MRM concept was incorporated into a later computational method to discover functional miRNA-mRNA regulatory modules (FMRMs) 
. This method associated expression data and specific biological conditions (cancer versus normal tissue) with the previous bipartite graph structure. An algorithm of population-based probabilistic 
and an alternative method of rule-based learning 
were also employed. However, these approaches all relied on known target prediction algorithms and/or gene ontology (GO) data, and were either theoretical, or did not consider actual expression data, or were based on rules that were generated using over-expression systems.
Two approaches 
used Bayesian Network methods to explore interactions between miRNAs and mRNA transcripts. The first report developed a probabilistic graphical model for miRNA regulation and variational learning for detecting miRNA targets. The second report developed a strategy of splitting and averaging of Bayesian networks (SA-BNs). In this method, miRNA-mRNA interactions in the NCI60 cell lines were split into different sample categories, and both up- and down-regulation information was included. These improvements were believed to capture stronger interactions than were detected using conventional Bayesian Networks. However, this study only described a subset of miRNAs and genes regulating EMT. The analysis culminated in predictions on signaling networks regulated by miRNAs using the Ingenuity pathway analysis software, which is based on published data and requires constant updating. While a Bayesian Network provides a powerful method for constructing regulatory networks, it has its limitations in predicting miRNA regulatory modules. 1) According to the definition of a Bayesian Network, loop structure is not allowed, which means a Bayesian Network cannot be applied to regulatory feedback loops as they are common between miRNAs and their targets (e.g. the relation between ZEB1
and the miR-200
family). 2) The learning procedure of a Bayesian Network is computationally exhaustive 
, and the space of possible structures is limited by gene number, as linear increases in gene number cause exponential growth of the required space.
A recent approach was based on the statistical enrichment of miRNA targeting signatures in annotated gene sets. It allowed prediction of a number of novel targets for various miRNAs and prediction of disease relevant pathways 
. However, this analysis was again based solely on predicted functions of genes and predicted seed matches, and actual expression data were not considered. In addition, it used target prediction algorithms as well as GO data. The analysis was based on a number of assumptions and requires frequent updating, which may affect some of the reported findings. In contrast to the previous approaches, our study is based on actual and unperturbed gene expression data, and none of our key findings rely on any known target prediction algorithm or GO data.
The in silico titration method sPCC
Due to the nature of how miRNAs function, we designed the sPCC method as an in silico titration. We argued that, in order to detect mRNAs whose expression was repressed by miRNAs, the miRNA had to be expressed. We predicted that very high expression of miRNAs would suppress expression of targets representing certain biological processes. In contrast, in cells completely lacking expression of miRNAs, targeted genes would either be expressed or not expressed as a consequence of other regulatory mechanisms. In other words the “pressure” of a miRNA on a gene should be best detectable by comparing cells with high versus low expression of a miRNA. In order to reflect this pressure, different weights were assigned to each of the cell lines based on their miRNA expression levels. One could argue that the classical PCC method might also be modified in this way. However, in practical terms, it would be arbitrary to add weights directly to each cell line. First, the PCC calculation is not a linear process, and hence any change within the PCC function may have uncontrollable consequences. Second, it would not be clear how much weight to assign to each cell line. To solve this problem, we adopted the “titration” method sPCC, with a gradient weighting approach for cell lines sorted according to miRNA expression. We decided to pick 30 cell lines as a threshold, for the following reasons: 1) the model adds weight at the level of cell line selection, so it is linear and additive for PCCs with different cell line selections (e.g. from pattern 30 to pattern 59); 2) the model fully considers the biological importance of the 30 cell lines with the highest expression of each miRNA, as every pattern includes them; and 3) the minimal number of 30 cell lines ensured the methodological stability, as PCC is highly unstable and easy to be interfered by background noise when its sample size is too small (e.g. <10). We intentionally reduced the importance of the rest of the 29 cell lines, rather than simply neglecting them. If we had performed the dPCC analysis by only using the 30 highest expressing cell lines, it would be problematic since different sets of cell lines would be selected for each miRNA. In order to test the stability of our threshold selection, we used a randomized sPCC (rsPCC) method as an internal negative control ().
The performance advantage of the sPCC was obvious when the methods were tested for the ability to detect the conserved miRNA targets with the highest total TargetScan context score in the human genome. Only with the sPCC method the correlation increased with the more strongly predicted targets (). We chose to use the TargetScan predicted genes as opposed to a list of experimentally validated targets because TargetScan provides total context scores, which can be used to rank predicted miRNA-gene pairs (e.g. from top 50 to top 500). While most of these targets are not validated it has been demonstrated that the number of likely targets is greater towards the very top of each ranked list of targets 
. Our sPCC analysis confirmed this. Furthermore, the sPCC method performed better in two other analyses (host gene and HOX
gene cluster gene correlations) when compared to the dPCC method. Interestingly, only the sPCC was good enough to group miRNAs for exploring their potential biological functions. Regardless of whether positively or negatively correlating genes were used, the dPCC method failed to group some of the most biologically relevant miRNAs, such as members of the miR-200
families (data not shown). As an important control the sPCC method performed far better than the randomized rsPCC method. Hence, this method is robust enough to explore biological functions of miRNAs. We decided to let the user decide whether to use the dPCC or sPCC analysis on miRConnect.org
. In summary, our data demonstrate that across all NCI60 cells the sPCC method performed best at both the theoretical and biological level, permitting detection of general connections between miRNAs and their downstream effectors.
Host genes and HOX genes
One mechanism that determines co-expression of miRNAs and mRNAs is when both are driven by the same promoter. This is the case for miRNAs that are part of host genes. A similar situation occurs at the 4 human HOX gene clusters, which provided us with a unique system in which each cluster of multiple genes contained at least one miRNA gene. For both the host gene and the HOX gene cluster case we found a highly significant correlation between the expression of the gene(s) and the miRNAs that are linked to their co-transcription. Of the 73 host gene/miRNA pairs in our analysis of the Q data set, 37 showed a positive sPCC, but negative sPCCs were only found in 5 cases and they were of low significance (). The correlation of miRNAs with their hosting HOX genes was even more impressive. In each case a substantial number of HOX genes positively correlated with the expression of the miRNA in that cluster (9/11 HOXA genes with miR-196b, 8/10 HOXB genes with miR-10a, 5/9 HOXC genes with miR-196a, and 6/9 HOXD genes with miR-10b), and few HOX genes negatively correlated with these miRNAs. This analysis in particular illustrates the strength of the expression correlation that can be found using the NCI60 data analysis. The sPCC method achieved the highest positive correlations between the four miRNAs and their adjacent HOX genes. Out of ~18,000 human genes the following HOX genes had the highest positive sPCC for the miRNAs that are coded within a HOX gene cluster: HOXA9 and miR-196b, HOXB7 and miR-10a, HOXC10 and miR-196a and HOXD8 and miR-10b. Interestingly, our data allow us to predict that miR-196a-1, which is part of the HOXB gene cluster, is not significantly expressed in cancer cells, since neither of the two methods of analysis detected any HOXB genes that were co-expressed with this miRNA ().
Novel and unusual predictions that come out of our analysis
Our analysis makes a number of predictions that cannot be explained based on current knowledge. These include the prediction that let-7 is a family of at least 6 different potential activities in cancer cells (functional clusters II, IV, V, VI, X and XIII in ). Our data now provide the impetus to look into this phenomenon. We hypothesize that nine different let-7 family members do not just exist to allow tissue-specific expression of let-7.
The finding that seems to be most at odds with published data is our results on miR-21
is the miRNA upregulated in the most human cancers when compared to normal tissue 
, and miR-21
was recently shown to induce neoplastic transformation when expressed as a transgene in mice 
. Confirming published data, miR-21
expression negatively correlated with two of its main validated targets, PDCD4 
−4.7) and PTEN 
has, therefore, been viewed as an oncogenic miRNA, and one would expect it to be one of the most growth-promoting miRNAs. It was surprising, therefore, that in our study miR-21
was co-clustered with miRNAs that had a strong negative correlation with genes induced by c-MYC
(functional cluster X in ), which would indicate a growth suppressing activity. We also determined the extent of correlation between the expression of miRNAs and 99 ribosomal genes as markers for cell growth. Interestingly and consistently with the results on the c-MYC
regulated genes, expression of miR-21
had a strong negative correlation with the expression of almost all ribosomal genes. In fact when the miRNAs were ranked according to the correlation with ribosomal genes, miR-21
was the third most clearly negatively correlating miRNA (Figure S6
). How could this finding be reconciled with the existing paradigm of miR-21
being such a significant marker for cancer cells? We recognize that all reports identifying miR-21
as being overexpressed in cancer cells were based on comparisons of cancer with normal cells 
, while our analysis does not include normal cells. Our finding may, therefore, have revealed a cellular function of miR-21
that is not cancer specific. Interestingly, there is very little information on the physiological function of miR-21
outside of cancer. However, two pieces of data provide a clue as to the physiological activity of miR-21
. In the context of a cancer cell line, in an unbiased approach 94 different endogenous miRNAs were individually inhibited in HeLa cells. Inhibition of most miRNAs resulted in reduced growth of the cells. Only two miRNAs were identified as growth repressors, as their inhibition accelerated the growth of the cells. By far the most growth suppressing miRNA was miR-21 
. In normal tissues miR-21
has also been recognized as a paradoxical miRNA 
. In the context of heart hypertrophy, miR-21
was reported to inhibit hypertrophy of cardiomyocytes 
. Based on these data and our analysis we predict that one of the physiological activities of miR-21
is to inhibit, rather than to promote, cell growth.
EMT markers and new EMT regulators
Induction of EMT induces changes in expression of thousands of genes in the human genome. Many of these changes involve sharp on-off regulation of genes. Examples of genes that are regulated in a switch-like manner are E-cadherin
where the difference in the ratio of their expression across the NCI60 cells spans 8 orders of magnitude 
. Neither E-cadherin
are direct targets of miRNAs that regulate EMT, but they are powerful markers for the two cellular stages of epithelial and mesenchymal cells. Interestingly, the remarkable similarity in the genetic programs of EMT is documented by the three EMT gene signatures used in this study, and the resulting similarity in correlation of gene expression with that of certain groups of miRNAs. Our analysis allowed us to identify and validate three new regulators of epithelial cells, miR-7
was previously shown to target the epidermal growth factor receptor (EGFR), a known regulator of EMT 
, and EGFR was also the gene with the highest TargetScan total context score that was most negatively correlated with miR-7
−2.26; see miRConnect.org
was recently linked to EMT regulation and stem cells 
. In our analysis, miR-203
was a potent inducer of E-cadherin
upregulation. Predicted targets include the EMT regulator Snail2 (Slug) which also had a low negative sPCC of −8.01 (miRConnect.org
). Finally miR-375
was recently reported to target 3-phosphoinositide-dependent protein kinase-1 (PDK1) 
which promotes invasion and activation of matrix metalloproteinases 
. We also identified a number of miRNAs that are co-expressed with mesenchymal genes. Among these we found miR-155
, which we had already identified as most highly expressed in mesenchymal cancer cells 
, and this was recently confirmed by others 
miRConnect-Q and miRConnect-L
On the miRConnect.org
site we have made data sets available based on two different miRNA expression data sets. In a recent analysis the level of concordance between 4 different miRNA expression data sets obtained with the NCI60 cells was moderate 
. The PCC between the L and the Q data sets used in our study was found to be 0.56. Nevertheless we found that the discrepancies were often in miRNA species with relatively low expression levels. Data on major miRNA families including let-7
or the miR-17~93
family were more consistent (data not shown). Regardless of the nature of the differences they must not necessarily be due to fluctuation between the analyses. The two platforms that generated the Q and the L data sets are very different and each has both advantages and caveats. The LNA-based arrays that underlie the L data set cannot discriminate between mature miRNAs and their pre-miRs but are more robust in detecting different isomiRs. In contrast, the Q data set relies on real-time PCR involving loop primers. While they are highly selective for the mature miRNAs they cannot distinguish between certain isomiRs. This could be relevant because in a recent study it was demonstrated that miRNAs exit as multiple isomiRs and this varies among species and tissues 
. We therefore decided to make the analysis based on both data sets available on miRConnect.org
Expression levels of miRNAs and mRNAs have been correlated previously in attempts to establish causality 
. This has been complicated by the fact that certain mRNAs are both regulators of miRNA expression/biogenesis and targets of these miRNAs. Examples are c-MYC
, and ZEB1
and miR-200 
. Our work provides an additional way to link miRNA expression to biological function (i.e. miR-200
to EMT or miR-10/miR-196
gene expression) without the need to validate specific mRNAs as miRNA targets.
Based on our data, we propose to focus more on individual pathways when assigning cancer relevant functions to miRNAs and less on global activities. We believe that the functional grouping of miRNAs using miRConnect will significantly aid miRNA researchers and will provide a novel resource to the field that complements the study of miRNA function based on their seed sequences. We propose to group miRNAs families according to seed families, gene clusters, and functional correlations.