|Home | About | Journals | Submit | Contact Us | Français|
MicroRNAs (miRNA) are small regulatory, noncoding RNA molecules that are transcribed as primary miRNAs (pri-miRNA) from eukaryotic genomes. At least in plants, their regulatory activity is mediated through base-pairing with protein-coding messenger RNAs (mRNA) followed by mRNA degradation or translation repression. We describe NOVOMIR, a program for the identification of miRNA genes in plant genomes. It uses a series of filter steps and a statistical model to discriminate a pre-miRNA from other RNAs and does rely neither on prior knowledge of a miRNA target nor on comparative genomics. The sensitivity and specificity of NOVOMIR for detection of premiRNAs from Arabidopsis thaliana is ~0.83 and ~0.99, respectively. Plant pre-miRNAs are more heterogeneous with respect to size and structure than animal pre-miRNAs. Despite these difficulties, NOVOMIR is well suited to perform searches for pre-miRNAs on a genomic scale. NOVOMIR is written in Perl and relies on two additional, free programs for prediction of RNA secondary structure (RNALFOLD, RNASHAPES).
MicroRNAs (miRNAs) are genome-encoded single-stranded RNA molecules of ~22nt in length, which play a significant role in regulation of gene expression in eukaryotes. Many details on biogenesis and interactions of miRNAs are known (see recent reviews, e.g., [1, 2]). Briefly, miRNAs can be encoded by miRNA genes, but also be generated from different RNA transcripts (e.g., from introns of protein-coding genes). Plant and animal miRNAs differ to some extent with respect to biogenesis and structural characteristics but also in their mode of action. In plants, most if not all miRNAs are transcribed from genes by RNA-dependent RNA polymerase II (polII) into primary transcripts called pri-miRNA; these transcripts fold into (possibly imperfect) stem-loop structures. From the pri-miRNA Dicer-like (DCL) enzymes process the stem-loop structure (pre-miRNA), which is usually longer (~130nt; see below) than nonplant pre-miRNA (~86nt), and finally a miRNA/miRNA* duplex. In the cytoplasm, the miRNA is incorporated into the RNA-induced silencing complex (RISC), and base-pairing of the miRNA with complementary messenger RNA (mRNA) regions leads to mRNA degradation or to inhibition of mRNA translation. Most plant miRNAs base-pair with their respective target mRNAs in the coding region with perfect or near-perfect complementarity leading to cleavage (and degradation) of the mRNAs; animal miRNAs usually base-pair with 3′ untranslated regions through imperfect complementarity leading to translation repression.
Finding of miRNA genes either needs costly experimental approaches—for example, genetics, which led to the detection of the first animal miRNAs [3, 4], cloning and sequencing of cDNA, or deep sequencing—or computational prediction methods, which facilitate subsequent experimental verification or falsification. The different properties of miRNAs in plants and animals gave rise to different computational approaches (for reviews see [5–7]). Most of these tools, however, rely on the following features: the miRNA resides in a stem-loop structure, which possess a high thermodynamic stability and does not contain large internal loops or asymmetric bulges at least in the region of the mature miRNA . In addition, many tools take into account a phylogenetic conservation of the pre-miRNA structure and miRNA sequence, which limits the chance to detect non-conserved, evolutionary new miRNA genes. For example, Dezulian et al.  identify plant miRNA homologs in a set of sequences, given a query miRNA, by a sequence similarity search step and a set of structural filters; Pfeffer et al.  identify DNA-viral pre-miRNAs, which show neither detectable conservation to other viral pre-miRNAs nor to host pre-miRNAs, by a search for stable stem-loops and scoring of these according to free energy of folding, base composition, and number of base pairs; Wang et al.  as well as Jones-Rhoades and Bartel  search for putative miRNA/miRNA* complexes in the intergenic regions of Arabidopsis thaliana and filter these according to GC content, mismatches in the stem, conservation in the rice genome, and the characteristic stem-loop structure.
To our knowledge, the only tools for de novo prediction of pre-miRNAs in plants are HHMMiR  and triplet-SVM . HHMMiR calculates first the mfe structure of sequence regions (using RNAfold in a scanning window approach with window length of less than 500nt), extracts stem-loops that possess at least 10 base pairs, a minimum length of 50nt, a loop of less than 20nt and no multiloop(s), and finally classifies via a hierarchical hidden Markov model (HHMM). The sensitivity of HHMMiR is published to be 0.865 for Oryza sativa (96 sequences taken from miRBase 5) and 0.973 for A. thaliana (75 sequences). triplet-SVM calculates by RNAfold the mfe structure of sequences, rejects those with junction(s), too few base pairs, and a high free energy (i.e., low structural stability), parses the remaining structures in “triplets” (type of nucleotide plus paired or unpaired state of the nucleotide and its two neighbors), and finally classifies these features with a support vector machine (SVM). The sensitivity of triplet-SVM is published to be 0.948 for Oryza sativa and 0.92 for A. thaliana using the same sequences from miRBase 5 as in the test with HHMMiR.
In the following, we describe our tool, called novoMIR, to detect pre-miRNA and miRNA/miRNA* sequences in a plant genome. For this purpose novoMIR uses a series of filter steps, similar to those mentioned above, followed by a statistical model to discriminate a pre-miRNA from all other RNAs and by another statistical model to locate the miRNA/miRNA* complex in a putative pre-miRNA. Thresholds and statistical values are learned from sets of true positive sequences (plant pre-miRNAs taken from miRBase; ) and true non-miRNA sequences (tRNAs, 5 S rRNA, 5.8 S rRNA, mRNAs, etc.). For detection, novoMIR relies neither on comparative genomics nor on prior knowledge of a miRNA target; thus novoMIR allows for searches in single plant genomes as well as in viral or viroid genomes.
Sequences of plant pre-miRNAs were obtained from different versions of miRBase [15, 16]: version 10.0 contains 1,247 sequences; the recent version 14 contains 2,030 sequences. The mean and median length of plant sequences are about (150 ± 73)nt and 130nt, respectively (see Figure S1 in Supplementary Material available online at doi:10.4061/2010/495904); the shortest pre-miRNA is 54nt in length (miRBase ID: gma-MIR2107) and the longest is 932nt (cre-MIR916). The mean and median length of nonplant sequences are about (88 ± 14)nt and 86nt, respectively; the shortest pre-miRNA is 44nt in length (hsa-mir-1973) and the longest is 215nt (dme-mir-997). That is, most plant pre-miRNAs are longer than animal pre-miRNAs and their size range is more diverse. The sequences of pre-miRNAs and mature miRNAs are slightly enriched in U  and U plus G, respectively (see Figure S2). The four nucleotides are not equally distributed at each position along the miRNA sequences (see Figure S3): for example, a U is the preferred 5′ nucleotide (f1,U = 0.65), a G on position 8 (f8,G = 0.44), and a C on position 19 (f19,C = 0.52). The minimum free energy ΔG37°C0 of the secondary structures of pre-miRNAs, as calculated by RNAfold  using default parameters, is in a wide range due to the different lengths L and G+C contents fGC of the sequences (see Figure S4); normalization of ΔG37°C0 to length and fGC  results in ΔG37°C0/L = (−0.45 ± 0.12) kcal/mol/nt and ΔG37°C0/L/fGC = (−1.02 ± 0.26) kcal/mol/nt; the latter value is significantly lower than that of other RNA according to Zhang et al. .
We used the 184 pre-miRNAs and mature miRNAs of A. thaliana as listed in miRBase version 10 as the true-positive data set for establishing all thresholds and parameters of novoMIR. Sequences containing nucleotides other than A, C, G, U(T) were discarded. For evaluation of sensitivity we used in addition the plant pre-miRNAs and mature miRNAs from miRBase version 14 (190 from A. thaliana and 1,853 from other plants). The sensitivity of novoMIR was nearly identical for both data sets (and also with sequences from version 14minus those from version 10; see supplemental Table S1); thus we refrained from training with different data sets.
As the true-negative data set, we assembled RNA sets from the following sources:
novoMIR is written in Perl and was tested under Linux. It relies on RNAshapes [24, 25] and RNALfold  (which is part of the Vienna RNA package ) for secondary structure calculations. RNALfold finds subsequences of a long RNA sequence that fold into locally stable (i.e., thermodynamically favorable) RNA secondary structures; the computational effort is (NL2) with length N of the long RNA sequence and maximal base-pair separation L of the subsequences. For an RNA sequence, RNAshapes computes shapes, which are classes of similar secondary structures, and a representative structure “shrep” of minimal free energy within each shape.
In the following, we describe the workflow of novoMIR (see supplemental Figure S5).
Our programnovoMIR uses a set of heuristic filters and a statistical model to discriminate a miRNA precursor from all other RNAs (see Figure S5). The data for this model are collected based on a set of true positive sequences (miRNA precursors from A. thaliana as in miRBase 10) and a set of true non-miRNA sequences (for details, see Section 2.3).
All thresholds for the filter steps and the probabilities for the Hidden-Markov model were selected on the basis of “receiver operating characteristic” (ROC) curves like those shown in Figure 2. For these, the set of true positive A. thaliana pre-miRNA sequences was taken from miRBase version 10. Sensitivity values for the enlarged set of pre-miRNAs from miRBase version 14 (190 A. thaliana and 1,840 sequences from other plants) are compared to those obtained from miRBase version 10 (184 A. thaliana and 1,063 sequences from other plants) in Table 1. The sensitivity values of novoMIR for A. thaliana pre-miRNA sequences of both miRBase versions are very close to each other (0.837 and 0.832, resp.). The values for all plant pre-miRNA sequences are slightly lower (0.791 and 0.792, resp.), but show no clear trend that sequences of miRBase 14 (not present in miRBase 10) are different from those of miRBase 10 or that sequences from a certain taxonomic group might be different from those of others (see supplemental Table S1).
The sensitivity of novoMIR in predicting the position of the miRNA/miRNA* complex is also high (0.73 for A. thaliana and 0.82 for all plants; see Table 1). For this, a position is counted as correctly predicted if it matches exactly the annotated mature miRNA or overlaps by five or fewer nucleotides.
We tested HHMMiR  and triplet-SVM  for sensitivity with the sequences from miRBase 10 and 14 (see Table 1). Their sensitivity is at maximum 0.15 and 0.45, respectively. The filtering steps of both tools reject already many sequences (HHMMiR more than 80% and triplet-SVM more than 22%). For the sequences remaining after the filtering steps, the sensitivity of the HHMM and SVM is at maximum 0.79 and 0.60, respectively, which is also lower than that of novoMIR with a sensitivity of at least 0.80 (using all filter steps).
We assembled different data sets to test the specificity of novoMIR. These data sets should not contain any true (pre-)miRNA. For example, we used well-annotated RNAs (mRNA, noncoding RNA) and sets of “pseudohairpins” from H. sapiens and A. thaliana. Similarly, the chance is negligible that the data set of 10 × 5,000 sequences randomly selected from the A. thaliana genome contains a true miRNA. The most difficult data set consisted of A. thaliana mRNAs; with these novoMIR reached a specificity of 0.975 (see Table 2). With all other data sets specificity was from 0.98 up to 1.00.
We wanted to test the program with a more realistic scenario, given the satisfying sensitivity and specificity values of novoMIR with our test data (see Tables Tables1 and1 and and2).2). We selected all intergenic and intronic regions of the A. thaliana genome from “The Arabidopsis Information Resource” (TAIR), removed all pre-miRNA sequences, and searched within the remaining sequences for potential pre-miRNAs via novoMIR. novoMIR classified 828 sequences from the 30,413 intergenic sequences and 649 sequences from the 148,558 intronic sequences, respectively, as potential pre-miRNAs.
Despite this pleasingly low numbers of hits, however, an interpretation of this outcome is not easy. To get an impression on the hits, we searched with these potential pre-miRNA sequences with BLAST for any annotation and for the miRNA-typical expression pattern in the “Arabidopsis Small RNA Project Database” (ASRP) [28, 29]; such a typical expression pattern of a pre-miRNA includes sequences for the miRNA as well as for the miRNA* (for an example see supplemental Figure S7). To our surprise, we detected that some of the predicted candidates are already described as true pre-miRNAs. An example of such a sequence, predicted by novoMIR as a potential pre-miRNA, is located on A. thaliana chromosome 3 in the region between genes At3G09280 and At3G09290. Its secondary structure and its support by expressed small RNAs are shown in Figure 3 and Figure S8, respectively. It is already known as pre-miR2111a [30, 31], but not present in miRBase 14. The sequences of the mature miR2111a and of miR2111a* predicted by novoMIR also coincide with the sequences given in .
In the following, we mention shortly three further candidate hits, for which we found some support by small-RNA expression in the ASRP but no explicit annotation. One novoMIR hit is located on chromosome 4 between At4G22760 and At4G22770 close to the 3′ terminus of the latter, but on the opposite strand; for further details, see Figure 3 and Figure S9. The next hit (see Figure 3 and Figure S10) is located in between At5G52689 and At5G52690. The last mentioned hit is located in an intron of AT1G01650, which encodes for an aspartic-type endopeptidase/peptidase; the structure of this sequence is shown in Figure 3 and the expression pattern of the genomic region in Figure S11.
Several candidate hits have no support by small RNAs in the ASRP. It is known that many miRNAs are induced by biotic and abiotic stress [36–38]. Thus, a lack of small RNAs might either point to a false-positive prediction or to a stress condition not analyzed for expression of small RNAs. Further candidate hits are located in regions showing expression patterns similar to those of repetitive elements. A recently published review  discussed the possibility that some miRNAs could be evolved from repetitive genomic elements and/or duplication of genomic regions.
Viroids are plant-infectious, noncoding, unencapsidated, circular RNAs that are transcribed in a rolling-circle mechanism either in nuclei (Pospiviroidae) or in chloroplasts (Avsunviroidae) of infected plants. Viroids cause the production of viroid-specific small RNAs (vsRNA) similar in size to small interfering (siRNA) and miRNAs, but they do escape the cytoplasmic silencing mechanism. A positive (or negative) novoMIR prediction of viroids as potential pre-miRNAs would point to the genesis of vsRNAs. For further details, see recent reviews [40–43].
Potato spindle tuber viroid (PSTVd) is the type strain of Pospiviroidae. Because of its high self-complementarity the circular PSTVd RNA folds into a rod-like secondary structure of high thermodynamic stability (see Figure 4). This structure can be divided into five structural domains on the basis of homology between different pospiviroids . Most sequence variants or strains of PSTVd differ by mutations in the pathogenicity-modulating (P) domain and/or variable (V) domain. Only a few nucleotide changes in the P domain are sufficient to exhibit remarkably different symptoms in infected tomato plants Solanum lycopersicon cv Rutgers. If this P domain would be the source of miRNA-like vsRNAs, these could interfere somehow with the host's metabolism leading to symptom production.
For an RNA with PSTVd sequence from positions 263–359/1–96, which is one of the structural elements present during processing of (+)-strand replication intermediates to circles , novoMIR predicted miRNA/miRNA* complexes in the P domain of PSTVd; for an RNA from positions 103–255, which is also a structural elements during processing, novoMIR predicted a further miRNA/miRNA* complex in the TR domain, but only after lowering the normalized energy threshold from the default value tΔG/L/fGC = 0.75 to 0.69. Both regions are marked by italic characters in Figure 4. novoMIR predicted identical positions for complexes in a full-length, linear PSTVd (1–359). Especially the prediction of vsRNAs derived from the P domain supports an involvement of vsRNAs in symptom production via vsRNA-induced (mis)regulation of plant-endogenous RNAs like mRNAs coding for transcription factors. This hypothesis is supported by deep-sequencing of PSTVd-derived vsRNAs in PSTVd-infected tomato plants (Diermann, Matoušek, Teune, Riesner and Steger, submitted) and sequencing of vsRNAs produced in vitro by DCL processing of PSTVd  which showed clusters of vsRNAs derived from the P domain. In contrast, [45, 46] found only vsRNAs in PSTVd-infected tomato plants that clustered in regions outside of the P domain. This discrepancy is unresolved but might be based for example on different purification procedures of the vsRNAs.
Plant pre-miRNAs are more heterogeneous in size and structure than animal pre-miRNAs but still show sufficient characteristic features—such as relative thermodynamic stability of their structure, length of helices, and number and size of loops—to be differentiated from other RNAs. Based on several of these features, we developed a series of filter steps and a statistical model that together are able to detect pre-miRNAs with a sensitivity of about 0.8 and a specificity of about 0.99. Thus, the program, which we call novoMIR, is well suited to search on a genomic scale for new pre-miRNAs that are not necessarily evolutionarily conserved. As an example, we searched with novoMIR for pre-miRNAs in nontranslated regions of the A. thaliana genome and detected among the high-scoring sequences experimentally verified pre-miRNAs, which were not annotated in the recent version of miRBase. Additionally, novoMIR recognizes viroids as pre-miRNAs, which supports the hypothesis that viroid-specific small RNAs are generated in a miRNA-like pathway.
Fig. S1: Length distribution of plant pre-miRNA in different versions of mirBase.
Fig. S2: Nucleotide composition of plant pre-miRNAs and mature miRNAs in different versions of miRBase.
Fig. S3: Nucleotide distribution in mature plant miRNAs from mirBase version 14.
Fig. S4: Minimum of free energy (in kcal/mol) of secondary structures of plant pre-miRNA in mirBase version 14.
Fig. S5: Workflow in novoMIR.
Fig. S6: Hidden-Markov model used for classification.
Fig. S7: Expression pattern of ath-MIR842a viewed in the ASRP browser.
Fig. S8: Expression pattern of a pre-miRNA candidate viewed in the ASRP browser.
Fig. S9: Expression pattern of a pre-miRNA candidate viewed in the ASRP browser.
Fig. S10: Expression pattern of a pre-miRNA candidate viewed in the ASRP browser.
Fig. S11: Expression pattern of a pre-miRNA candidate viewed in the ASRP browser.
Table S1: Comparison of novoMIR's sensitivity for detecting plant pre-miRNAs from different versions of mirBase.
The project was supported by a grant from the German Science Foundation to Detlev Riesner and G. Steger.
The authors declare that they have no competing interests.
Acknowledgment is given to Dr. M. Schmitz and Dr. L. Nagel for critical reading of the paper. The package and supplementary information can be downloaded at http://www.biophys.uni-duesseldorf.de/novomir/.