Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2744435

Formats

Article sections

Authors

Related links

J Theor Biol. Author manuscript; available in PMC 2010 August 21.

Published in final edited form as:

Published online 2009 May 20. doi: 10.1016/j.jtbi.2009.05.007

PMCID: PMC2744435

NIHMSID: NIHMS119512

The publisher's final edited version of this article is available at J Theor Biol

See other articles in PMC that cite the published article.

MicroRNAs (miRNAs) are short noncoding RNAs involved in post-transcriptional gene regulation via binding to mRNAs. Studies show that in a multicellular organism microRNAs (miRNAs) downregulate a large number of target mRNAs. However, predicting the target genes of a miRNA is challenging. Microarray expression profiling has been proposed as a complementary method to increase the confidence of miRNA target prediction, but it can become computationally costly or even intractable when many miRNAs and their effects across multiple tissues are to be considered. Here, we propose a statistical method, the relative *R*^{2} method, to find high-confidence targets among the set of potential targets predicted by a computational method such as TargetScanS or by microarray analysis, when expression data of both miRNAs and mRNAs are available for multiple tissues. Applying this method to existing data, we obtain many high-confidence targets in mouse.

MicroRNAs (miRNAs), which are single-stranded RNAs of ~20–23 nucleotides, posttranscriptionally regulate gene expression. Computational and molecular cloning approaches have revealed hundreds of miRNAs in a variety of organisms (Ambros et al. 2003; Houbaviy et al. 2003; Lim et al. 2003; Kim et al. 2004). Many computational methods have been developed to predict miRNA targets; for example, TargetScanS predicts the targets of a miRNA by searching for the presence of conserved 8mer or 7mer sites that match the seed region of the miRNA. A small portion of these predicted targets have been experimentally validated, showing a relatively high accuracy for target prediction (Sethupathy et al. 2006).

Computational approaches currently use sequence complementarity and most of them also use evolutionary conservation to identify potential targets. It has been suggested that the false-discovery rate for computationally predicted targets is ~50% (Lewis et al. 2005; Farh et al. 2005). Besides the sequence complementarity approach, Grimson et al. (2007) pointed out that a crux for target recognition is ~7 nt sites that match the seed region of the miRNA. Since these seed matches are not always sufficient for repression, they uncovered five general features of site context that boost binding efficacy: AU-rich nucleotide composition near the site, proximity to sites for coexpressed miRNAs, proximity to residues pairing to miRNA nucleotides 13–16, positioning within the 3’UTR at least 15 nt from the stop codon, and positioning away from the center of long UTRs.

Profiling miRNA expression is very helpful for studying the biological functions of miRNAs, so it has been used to as a complementary method for discovering miRNA targets (Lim et al. 2005). However, this method can become computationally complicated when multiple miRNAs and their effects across multiple tissues are to be considered. To overcome this difficulty, we use statistical methods to build up a network of associations between the miRNAs and their target mRNAs.

In this study, the relative *R*^{2} method is proposed to select high-confidence targets from predicted targets. The relative *R*^{2} method can be explained statistically from the degree of fitness of a model in terms of a subset of independent variables.

A method for finding miRNA targets using Bayesian variation analysis was recently proposed by Huang et al. (2007). This method is complicated and requires extensive calculations. We apply our method to the same dataset used in Huang et al. (2007) and select 448 high-confidence targets such that the relative *R*^{2} for each target reaches 0.995, which is considerably higher than those targets predicted by Huang et al. (2007), whose average relative *R*^{2} is less than 0.9.

We introduce our method with the usual linear regression model, before considering the general model.

Consider *n* miRNAs, *z*_{1} , ……, *z _{n}*, and

First, for an mRNA, we can find the miRNAs, say *z*_{1} , ……, *z _{k}* , from the set of potential targets such that each of the miRNAs has this mRNA as its potential target. Our goal is to find from these k miRNAs the miRNAs that have a significant effect on the expression level of this mRNA. Assume that the expression levels for the mRNA in the

$${y}_{j}={b}_{0}{z}_{0j}+{b}_{1}{z}_{1j}+{b}_{2}{z}_{2j}.\dots +{b}_{k}{z}_{\mathit{\text{kj}}}+{\epsilon}_{j,}j=1,\dots ,l,$$

(1)

where ε_{j} is the error term.

In model (1), the best estimator of β = (*b*_{0},*b*_{1},*b*_{2},…,*b _{k}*) ' is = (

Let *f _{i}* = (

Our method of selecting miRNAs that significantly affect the level of the mRNA is first to rank the *k* miRNAs according to their p-values --- the smaller the p-value, the higher the rank. The p-value of miRNA *z _{i}* is defined as the probability

$$P(|W|\ge \frac{|{\widehat{b}}_{i}|}{\sqrt{\mathit{\text{Var}}({\widehat{b}}_{i})}}),$$

which is the p-value used to test : *H*_{0} :*b _{i}* =0, where

Rank the miRNA as the *j* th significant miRNA if its p-value is the *j* th smallest p-value. Calculate the *R*^{2} for model (1), say *g _{k}* . Consider the miRNAs that have a p-value less than a critical value, say

Assume that there are *m* (*m* ≤ *k*) miRNAs, *z*_{1},……, *z _{m}* , whose p-values are less than

$${y}_{i}={c}_{0}{z}_{0i}+{c}_{1}{z}_{1i}+{c}_{2}{z}_{2i}.\dots +{c}_{m}{z}_{\mathit{\text{mi}}}+{\epsilon}_{i}i=1,\dots ,l,$$

(2)

Let *c* =(*c*_{0},*c*_{1},…*c _{m}*),

$$\widehat{c}=({\widehat{c}}_{0},{\widehat{c}}_{1},\dots .,{\widehat{c}}_{m})={({Z}_{r}^{T}{Z}_{r})}^{-1}{Z}_{r}^{T}Y.$$

Denote the *R*^{2} for the regression model (2) as *g _{m}* . If

We define the value *g _{m}* /

However, even if *g _{k}* is not high, it is still possible that the mRNA is the true target for some miRNAs among these

Using the method, for each mRNA, we can assign a set of miRNAs such that these miRNAs significantly affect the expression of the mRNA in terms of the linear model. For a miRNA, we can collect the set of mRNAs such that each mRNA in the set is a significant potential target of this miRNA by the relative *R*^{2} method. Then the mRNAs collected are the high-confidence targets of this miRNA.

Note that in order to eliminate the difference between the different tissues used in the analyses, we can first transform the expression data of mRNAs by normalizing the data in each tissue such that the scale of the expression data used in each tissue is the same. We use the normalized expression data when we apply the above approach to select high-confidence targets.

The linear model used in Section 2.1 can be replaced by another kind of model, such as a nonlinear regression model. For a general model to fit the *y _{i}* by using {

Furthermore, it is feasible to apply the relative *R*^{2} method to other criteria such as the adjusted *R*^{2} , etc. Since the value of adjusted *R*^{2} may be negative, a situation that requires more consideration, the application of the relative method under other criteria is currently under investigation.

We now apply the new method to the data of Babak et al. (2004), which was used by Huang J.C. et al. (2007). The data set includes 1770 potential targets for 22 miRNAs across 17 tissues, which were predicted by Target-Scan in a dataset of 41699 mouse mRNAs in Babak et al. (2004) and Zhang et al. (2004). The 1770 potential targets are from 788 different mRNAs because some miRNAs have the same mRNA as their targets. The microarray expressions of the 41699 mRNAs across the 17 tissues can be represented by a 41699×17 matrix, and the microarray expressions of the 22 miRNAs across the 17 tissues can be represented by a 22×17 matrix. (The 22 miRNAs studied are let-7a, miR-1, miR-101, miR-107, miR-122a, miR-124a, miR-125b, miR-126, miR-133a, miR-16, miR-181a, miR-183, miR-194, miR-205, miR-22, miR-23b, miR-24, miR-26a, miR-29b, miR-34a, miR-92, and miR-93; and the 17 tissues studied are brain, femur, lung, heart, skeletal muscle, mammary gland, teeth, bladder, stomach, ES, spleen, embryo 12.5, embryo, placenta 9.5, embryo 9.5, small intestine, liver.)

To apply the relative *R*^{2} method to an mRNA in the 788 mRNAs, we first normalize the expression data of the mRNAs using the 41699 expression data for each tissue. The normalization method is first to calculate the mean and standard deviation of the 41699 expression values for each tissue. Then, for each tissue, the normalized expression data is the original expression data minus the mean and then divided by the standard deviation. Since the data set included 41699 expression data points, we can use it as a reference to make the normalization such that the scale used in the expression data of mRNA for each tissue is the same.

We find the miRNAs such that the mRNA is one of the potential targets of these miRNAs from the 1770-target dataset, and then use a regression model to fit the normalized expression data of the mRNA. Huang et al. (2007) used the Bayesian variation method to derive high-confidence targets. This method is more complicated and computationally more costly than the relative *R*^{2} method.

Using the present method, we can select high-confidence targets such that the relative *R*^{2} for each target reaches 0.995, given *p*_{0} =0.47. A total of 448 high-confidence targets are found and the average relative *R*^{2} for these 448 targets is 0.999. Here the *p*_{0} value is selected such that the number of about one-fourth targets in the 1770 targets can be selected by the method with the relative *R*^{2} reaching 0.995.

The above dataset can also be used to conduct a random permutation test of our method. We test whether our method can select more high-confidence targets from the set of the 1770 potential targets predicted by TargetScanS than from a set that is constructed by randomly assigning each one of the 1770 targets to one of the 22 miRNAs. The random permutation was repeated ten times, and the average number of selected high-confidence targets over the ten times was 336 (s.d. ≈ 27), which is significantly lower than 448, the number of high-confidence targets found by our method (see above) with a p-value of 1.4×10^{−10} . The p-value is derived from viewing the two proportions of the selected targets by the random permutation method and the relative *R*^{2}method as the proportions of two binomial distributions and from testing the equality of the two proportions by the normal approximation. In addition, we also compare the targets shared between the random permutation case and the selected high-confidence targets. The average number of the shared targets from several comparisons is 25. So most of the selected high-confidence targets are not the same as the targets selected by random permutation. This result upholds the relative *R*^{2} method because (i) the method can select more targets than random permutation and most of the high-confidence targets are not the same as the targets selected by random permutation, and (ii) the method gives accordant results between the expression data analyses and TargetScanS analyses.

To make a more extensive comparison with the random permutation case, we conduct simulations for different cases by varying the values of *p*_{0} and *s*. Figure 1 and Figure 2 show that the numbers of high-confidence targets selected from the 1770 potential targets are always greater than the numbers selected from the random permutation case. Figure 1 shows that the difference between the two numbers increases with *p*_{0} , which reinforces the argument in Section 2 that the condition about the constraint of *p*_{0} should not be too strict; otherwise, the advantage of the relative *R*^{2} method is limited by this constraint.

Relationship between the number of selected high-confidence targets and the *p*_{0} threshold used. The solid and dotted lines denote the numbers of high-confidence targets selected by the relative *R*^{2} method from the 1770 potential targets and from the dataset **...**

Relationship between the number of selected high-confidence targets and the relative *R*^{2} value threshold used. The solid and dotted lines denote the numbers of high-confidence targets selected by the relative *R*^{2} method for the 1770 potential targets and **...**

In addition, in Table 1 we present several sets of *p*_{0} and _{s} values for each of which the number of the selected targets by the relative *R*^{2} method with respect to these values is close to 450. It is seen that *p*_{0} is an increasing function of *s* when the proportion of the selected targets is set to be a fixed value. To obtain a fixed proportion of targets, there are more than one set of *p*_{0} and *s* values that can be chosen and the targets selected may be different with respect to different *p*_{0} and *s* . One may be interested in which set is more appropriate here. As we mentioned above, it is not recommended to choose a small *p*_{0} because it may confine the overall performance of the relative *R*_{2} method. Besides, from the correlation analysis in Section 4, the confirmed targets and miRNA may not have a high correlation when we investigate their relations individually, but can reveal the relation when we consider their overall performance by the relative *R*^{2} method. It corroborates the view that the selection of *p*_{0} should be more relaxed, while the selection of *s* can be more strict. Therefore, we select *p*_{0} as 0.47 and *s* as 0.995.

The 6 sets of p and s values for which the number of selected targets by the relative *R*^{2} method is close to 450.

The ratio of the number of high-confidence targets found by our method to the total number of the potential targets (1770) for each of the 22 miRNAs is shown in Figure 3.

The ratios of high-confidence targets for the 22 miRNAs selected by the relative *R*^{2} method to the targets for the 22 miRNAs selected by TargetScanS.

We can also check our analysis with the literature. We recover several confirmed targets from the literature (Bartel 2004, Cimmino et al. 2005, Farh et al. 2005, Lim et al. 2005, and Lewis et al. 2005). These include the relationships between miR-92 and the mitogen-activated protein kinase kinase 4 (MAP2K4) gene, between miR-16 and the B-cell CLL/lymphoma 2 (BCL2) gene, between miR-124a and the solute carrier family 15 member 4 protein (SLC15A4) gene, and between miR-124a and the homeodomain interacting protein kinase 1 (HIPK1) gene. We also recover the relationship between miR-181a and the B-cell CLL/lymphoma 2 (BCL2) from Tarbase (Sethupathy et al. 2006).

The relative *R*^{2} method is proposed to analyze the data from the relative instead of from the absolute statistical point of view. If the correlation between the mRNA and miRNA is high, then we can directly adopt a standard statistical method to explore the high confidence targets. However, when the correlation between the mRNA and miRNA is not high, it is challenging to develop a statistical method to select correct targets. In such a case, if we use a standard statistical criterion, such as a high *R*^{2} to select the high-confidence targets mRNAs, then no confirmed target mRNAs may be selected. In this case, it would be better to use a variable standard that is dependent on the mRNA and miRNAs under study, rather than using a single standard for all genes. The relative *R*^{2} method can provide a variable standard to solve this problem.

We now discuss the relation between the confirmed targets and the miRNA by exploring their correlation coefficients and relative *R*^{2} values. For analyzing the data and investigating the relationship of the miRNAs and their targets, before modeling the data, a good way is to investigate the relation of the miRNA with each individual potential target. Although this study is to find the effect of multiple miRNAs on a target, rather than the effect of a single miRNA, understanding the connection between the target mRNA and a miRNA is helpful for reinforcing the validity of the proposed method.

For example, let us consider the three mRNAs, HIPK1, MAP2K4 and BCL2, mentioned in Section 3 and explore the relationship between their correlation coefficient and the *R*^{2} value.

First, consider the HIPK1 mRNA, which is a potential target for the four miRNAs, miR-124a, miR-181a, miR-26a and miR-92. Although we are interested in knowing how the four miRNAs affect the expression of the mRNA, we also can investigate the relationship between the expression of each miRNA and the expression of HIPK1. The correlation coefficients of the expression for the four miRNAs with the expression of HIPK1, across the 17 tissues, are shown in Figure 4.

(a) The correlation coefficients between the homeodomain interacting protein kinase 1 ( HIPK1) mRNA expression level and the expression level of each of miR-124a, miR-181a, miR-26a and miR-92; (b) The correlation coefficients between the MAP2K4 mRNA expression **...**

By using the relative *R*^{2} method, three of the four miRNAs are selected such that its relative *R*^{2} reaches 0.995. The three selected miRNAs are miR-124a, miR-181a and miR-92 and their correlation coefficients with the HIPK1 mRNA are −0.566, 0.151 and −0.116, respectively (Figure 4(a)). Note that miR-26a, which is not selected, has the largest correlation coefficient 0.333. The correlations of two of the three selected miRNAs are negative, in agreement with the expectation that a miRNA usually downregulates its target mRNAs (Farh et al. 2005; Lim et al. 2005). The standard *R*^{2} values for the linear model of fitting the expression data of HIPK1 using the expression level of the four miRNAs and the expression level of the three selected miRNAs are 0.622 and 0.621, respectively. In this case, we can see from Figure 4 that the correlation coefficients between HIPK1 and the four miRNAs are not high. Therefore, it is hard to construct a model to fit the expression data of HIPK1 in terms of the expression data of the four miRNAs. So, if we use the standard *R*^{2} value, we may not be able to select any one of the three miRNAs, though one of the relations is confirmed in the literature. On the other hand, if we use the relative *R*^{2} method by comparing the ratio of 0.621/0.622 to 1, the confirmed relationship can be selected.

For the confirmed MAP2K4 mRNA and its corresponding miRNA miR-92, their correlation coefficient is −0.030 and the *R*^{2} is 0.198 (Fig. 4(b)). For the confirmed BCL2 mRNA and its corresponding miRNA miR-16 (Fig. 4(c)), their correlation coefficient is −0.027 and the *R*^{2} is 0.104. Therefore, if we use the correlation coefficient or *R*^{2} to select high confidence targets, these two confirmed mRNAs will not be selected. Because the two correlation coefficients are −0.030 and −0.027, we do not expect the two confirmed mRNAs to be selected by any statistical method from the absolute point of view. Instead, we need to use the relative criterion to select the targets because the coefficients of other miRNAs in the potential targets dataset are also not high. By including the other miRNAs in the potential targets dataset, we can construct the relative criterion to select the miRNAs such that the effects of the miRNAs on the expression level of the mRNA are found to be significant.

In addition, we present the overlap of the selected targets between Huang et al. 2007 and the relative *R*^{2} method in Figure 5. The total number of overlaps for the 22 miRNAs is 142. The overlap numbers for the three miRNAs, miR-126, miR-183 and miR-122a, are zero. The total overlap number is not large, perhaps because the correlation between the miRNA and their targets mRNA is not high, which was validated from several confirmed relationship as we mentioned above. This may lead to the variation between different statistical approaches.

The number of overlaps between the predictions by Huang et al. (2007) and the relative *R*^{2} method.

Besides, to estimate the accuracy of the relative *R*^{2} method, we compare the number of relationship appeared in Tarbase, but not found in the relative *R*^{2} method from the 1770 potential relationships. We found only two Tarbase interactions in the 1770 potential targets: the relationship between miR-181a and BCL2 mRNA and the relationship between miR-181a and HOXA11 mRNA. Only the relationship between miR-181a and BCL2 mRNA appeared in the 448 selected targets by relative *R*^{2}method and in those selected by Huang et al. (2007). However, the other relationship between miR-181a and HOXA11 mRNA can be selected by the relative *R*^{2}method if we relax the criteria by choosing 0.67 *p*_{0} and *s* = 0.9999 , which leads to 715 selected targets. Note that at first we set up *p*_{0} as 0.47 and *s* as 0.995, because we intended to obtain 25% of the targets (~ 450) among the 1770 potential targets as the high-confidence targets. But to coincide with Tarbase interactions, we can use the above new thresholds to select targets. The ratio of the number of the selected targets to the number of the potential targets is 715/1770 ≈ 0.4 . Thus, if we relax the thresholds to include 40% of the potential targets selected, then the two relationship found in Tarbase can be selected.

In summary, from the above discussions, combining results from the confirmed targets and theoretical statistical inference to develop methods for exploring the relationship between miRNA and mRNA can be more useful.

We thank Han Liang, Tsunglin Liu and Henry Lu for valuable suggestions. This study was supported by Academia Sinica, Taiwan, and by NIH grants GM30998 and GM081724.

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1. Ambros V, et al. A uniform system for microRNA annotation. RNA. 2003;9:277–279. [PubMed]

2. Babak T, Zhang W, Morris Q, Blencowe BJ, Hughes TR. Probing microRNAs with microarrays: tissue specificity and functional inference. RNA. 2004;10:1813–1819. [PubMed]

3. Bartel DP. MicroRNAs: Genomics, biogenesis, mechanism, and function. Cell. 2004;116:281–297. [PubMed]

4. Cimmino A, et al. miR-15 and miR-16 induce apoptosis by targeting BCL2. Proc Natl Acad Sci U S A. 2005;102:13944–13949. [PubMed]

5. Farh KKH, Grimson A, Jan C, Lewis BP, Johnston WK, Lim LP, Burge CB, Bartel DP. Science. 2005;310:1817–1821. [PubMed]

6. Grimson A, Farh KKH, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP. MicroRNA Targeting Specificity in Mammals: Determinants beyond Seed Pairing. Molecular Cell. 2007;27:91–105. [PMC free article] [PubMed]

7. Houbaviy HB, Murray MF, Sharp PA. Embryonic stem cell-specific microRNAs. Dev Cell. 2003;5:351–358. [PubMed]

8. Huang JC, Morris QD, Frey BJ. Bayesian Inference of MicroRNA Targets from Sequence and Expression Data. Journal of Computational Biology. 2007;14:550–563. [PubMed]

9. Kim J, Krichevsky A, Grad Y, Hayes GD, Kosik KS, Church GM, Ruvkun G. Identification of many microRNAs that copurify with polyribosomes in mammalian neurons. Proc Natl Acad Sci U S A. 2004;101:360–365. [PubMed]

10. Lagos-Quintana M, Rauhut R, Yalcin A, Meyer J, Lendeckel W, Tuschl T. Identification of tissue-specific microRNAs from mouse. Curr Biol. 2002;12:735–739. [PubMed]

11. Lau NC, Lim LP, Weinstein EG, Bartel DP. An abundant class of tiny RNAs with probable regulatory roles in *Caenorhabditis elegans*. Science. 2001;294:858–862. [PubMed]

12. Lee RC, Ambros V. An extensive class of small RNAs in *Caenorhabditis elegans*. Science. 2001;294:862–864. [PubMed]

13. Lee RC, Feinbaum RL, Ambros V. The *C. elegans* heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell. 1993;75:843–854. [PubMed]

14. Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005;120:15–20. [PubMed]

15. Lim LP, Glasner ME, Yekta S, Burge CB, Bartel DP. Vertebrate microRNA genes. Science. 2003;299:1540. [PubMed]

16. Lim LP, et al. Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature. 2005;433:769–773. [PubMed]

17. Saunders MA, Liang H, Li WH. Human polymorphism at microRNAs and microRNA target sites. Proc Natl Acad Sci U S A. 2007;104:3300–3305. [PubMed]

18. Sethupathy P, Corda B, Hatzigeorgiou AG. TarBase: A comprehensive database of experimentally supported animal microRNA targets. RNA. 2006;12:192–197. [PubMed]

19. Supplemental Data for Lewis et al. Cell. pp. 15–20. http://web.wi.mit.edu/bartel/pub/Supplemental%20Material/Lewis%20et%20al%202005%20Supp/ [PubMed]

20. Zhang W, et al. The functional landscape of mouse gene expression. J Biol. 2004;3:21–43. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |