Prediction of Arabidopsis miRNAs
To predict new miRNAs by computational methods, we defined sequence and structure properties that differentiate known Arabidopsis miRNA sequences from random genomic sequence, and used these properties as constraints to screen intergenic regions in the A. thaliana genome sequences for candidate miRNAs.
Besides the well known hairpin secondary structure of miRNA precursors, the 19 unique known
Arabidopsis miRNAs collected in Rfam [
37] were evaluated for the following computable sequence properties: G+C content in mature miRNA sequences, hairpin-loop length in their precursor RNA structures, number and distribution of mismatches in the hairpin stem region containing the mature miRNA sequence, and phylogenetic conservation of mature miRNA sequences in the
O. sativa genome. Sequences of all 19 known
Arabidopsis miRNAs had a G+C content ranging from 38% to 70%. For 15 of the 19 miRNAs, the predicted secondary structure of their precursors, or at least one precursor if a miRNA has multiple genomic loci, had a hairpin-loop length ranging from 20 to 75 nucleotides. In the hairpin structures formed by miRNA precursors, all miRNAs were found in the stem region of the hairpin, and had at least 75% sequence complementarity to their counterparts. Fifteen of 19 miRNAs were conserved with at least 90% sequence identity in the
O. sativa genome. Thus, constraints of G+C content between 38 and 70%, a loop length between 20 and 75 nucleotides, and a minimum of 90% sequence identity in
O. sativa were used to predict
Arabidopsis miRNA.
The first step was to search for potential hairpin structures in the Arabidopsis intergenic sequences. As most known Arabidopsis miRNAs are around 21 nucleotides long, we used a 21-nucleotide query window to search each intergenic region for potential miRNA precursors as follows: for each successive 21-nucleotide query subsequence, if a 21-nucleotide pairing subsequence with more than 75% sequence complementarity was found downstream within a given distance (hairpin-loop length), the entire sequence from the beginning of the query subsequence to the end of the complement pairing subsequence with a 20-nucleotide extension at each side was extracted and marked as a possible hairpin sequence (see Materials and methods for details). The minimum and maximum hairpin-loop lengths used in this prediction were 20 and 75 nucleotides. Each 21-nucleotide query subsequence and its downstream complementary subsequence were considered as 'potential 21-mer miRNA candidates' (referred to as '21-mers'). If a series of overlapping forward query sequences and their corresponding downstream pairing sequences were all identified from the same hairpin structure, each of them was initially considered as an individual 21-mer.
The second step was to parse miRNA candidates according to their nucleotide composition and sequence conservation. A filter of G+C content between 38 and 70% was applied to all 21-mers obtained from the above step, followed by a requirement for more than 90% sequence identity in the
O. sativa genome. The secondary structures of the resulting candidates were evaluated by mfold [
38]. Only 21-mers whose
Arabidopsis precursor and corresponding rice ortholog precursor both had putative stem-loop structures as their lowest free energy form reported by mfold were retained. Because some non-coding RNA genes were not included in the current
Arabidopsis gene annotation, orthologs of known non-coding RNA genes other than miRNAs were subsequently removed by aligning the 21-mers to non-coding RNAs collected in Rfam with BLASTN (version 2.2.6) [
37]. The 21-mers that passed all sequence and structure filters above were considered as final miRNA candidates. A summary of the prediction algorithm is shown in Figure .
In cases where two or more overlapping 21-mer miRNA candidates from the same precursor were collected in the final miRNA candidate set, each miRNA candidate was scored using the following formula:
miRNAscore = number of mismatches + (2 × number of nucleotides in terminal mismatches) + (number of nucleotides in internal bulges/number of internal bulges) + 1 if the miRNA sequence does not start with U.
The term 'terminal mismatches' in the formula above refers to consecutive mismatches among the beginning and/or ending nucleotides of a mature miRNA sequence. The term 'bulge' refers to a series of mismatched nucleotides. Because the sequences of most known miRNAs start with a U, a U-start preference was used in the formula above by penalizing non-U-start sequences. The sequence with the lowest miRNAscore from a series of overlapping 21-mers was selected as the final miRNA candidate.
In total, we predicted 95 miRNA candidates in the Arabidopsis genome, including 12 known Arabidopsis miRNAs and 83 new candidates. The former group corresponds to 63% of known Arabidopsis miRNAs to date (12 of 19). The remaining seven known miRNAs not included in the current prediction were filtered out as a result of their lower sequence conservation in the rice genome or longer loop length in their secondary structure, as outlined in Figure . Because of the complementarity between the two DNA strands of a given genome region, theoretically there should be two sequence possibilities for a predicted miRNA: the predicted sequence itself or, alternatively, its reverse complementary sequence located on the opposite strand of the genome. In many cases, however, owing to G::U pairing in RNA structure prediction, the complementary sequence of a miRNA precursor did not always exhibit a hairpin structure as its lowest energy folding form because the complement of a G::U pair, that is, C::A, altered the secondary structure. Thus, we were able to identify the coding strand of most predicted miRNA candidates through secondary structure evaluation. Furthermore, as described in the following sections, the sequences/partial sequences of some miRNA candidates or their precursors could be found in the Arabidopsis MPSS data used. As most MPSS data probably represent the expression of their associated miRNAs, we were able to use them to predict the miRNA coding strand. The coding strand of miRNA candidates that were contained in the ASRP database was determined according to cloned RNA sequences (see below for details). The complete list of predicted miRNAs is shown in Additional data file 1.
Experimental validation of predicted miRNAs
To gain support for the expression of the predicted miRNAs, northern blot hybridizations were carried out using RNA samples from different tissues selected to cover a spectrum of potential miRNA expression patterns. Using strand-specific oligonucleotide probes, positive signals of expression were detected for 14 out of 18 miRNA candidates tested. The results for all newly identified miRNAs are shown in Figure and . Oligonucleotide probes against the antisense strand of different miRNA candidates were used as negative controls, and none produced any signal, as shown for miR417 in Figure . Note that an extended exposure time was needed to detect expression of most miRNAs (indicated by a number in days in parentheses in Figure ), suggesting that their abundance is significantly lower than that of other known miRNAs (that is, miR158 and miR159a in Figure , and data not shown). In this analysis we also included 10 21-mers that were rejected by our miRNA prediction criteria as negative controls to evaluate the specificity of northern blot hybridization; as expected none of them produced a positive signal. The secondary structures of a few selected northern blot hybridization-positive miRNA candidates are shown in Figure . A full list of the secondary structures of predicted precursors of Arabidopsis miRNA candidates and their rice orthologs is available in Additional data file 2.
Among the 14 miRNAs that produced positive signals in the northern blot hybridizations, two are close paralogs of known miRNAs; miR169b is a paralog of miR169 and miR171b is a paralog of miR170. Because it is impossible to distinguish closely related sequences by northern blot hybridization, we were unable to rule out the possibility that signals detected by probes for miR169b and miR171b were contributed by their known miRNA paralogs. However, as miR169b was also identified in the ASRP database (see next section), we were able to conclude that miR169b was a real miRNA. Thus, 12 candidates validated by northern blot hybridization should be annotated as bona fide miRNAs (see Table for a summary).
| Table 1miRNAs verified by northern blot hybridizations and their supporting evidence |
Cloning evidence for predicted miRNAs
An ASRP database has recently been made publicly available [
36]. Sequences in the ASRP database were collected by cloning small RNA molecules with similar size to miRNAs and siRNAs [
39]. To check whether any of our predicted miRNAs can be identified by a standard RNA cloning method, we compared the 83 predicted miRNA candidates with all sequences in the ASRP database. Eight newly predicted miRNA candidates were found in the ASRP database (Figure ). Among them, five were identical to one or more cloned RNA molecules, indicating that we had correctly predicted the 5' and 3' ends and the actual length of these miRNA candidates. For the other three candidates, our predicted sequences were either shorter than, or a few nucleotides shifted from, their corresponding clones in the ASRP database. The exact sequences of these three miRNA candidates were then corrected according to the corresponding sequences in the ASRP database. The expression of miR169b and miR172b* was also detected by northern blot hybridization (Figure ). Although miR169h was present in the ASRP database, it could not be detected by northern blot hybridization (see Additional data file 1). According to the current miRNA annotation criteria [
22], these eight predicted miRNA candidates with corresponding cloned sequences in the ASRP database should be annotated as
bona fide miRNAs.
MPSS evidence for known and predicted Arabidopsis miRNAs
To further validate the predicted miRNA molecules, we took advantage of available
Arabidopsis massively parallel signature sequencing (MPSS) data. The MPSS sequencing technology identifies unique 17-nucleotide sequences present in cDNA molecules originated from polyadenylated RNA extracted from a cell sample. By inserting cDNA molecules into a cloning vector containing distinct 32-mer oligonucleotide tags, the MPSS technology ensures that each cDNA molecule is ligated to a unique tag and that more than 99% of the total cDNAs are represented after the cloning step. Tagged cDNAs are then amplified by PCR and hybridized to microbeads that have been precoated with multiple copies of unique anti-tags complementary to one type of 32-nucleotide tag. The expression level of a particular transcript is measured by counting the number of distinct microbeads that contain the same 17-nucleotide cDNA sequence. The MPSS technology does not require prior knowledge of a gene's sequence and thus can identify novel or rarely expressed genes. For a complete description, see [
40,
41].
To assess the degree to which MPSS data could be used to support predicted miRNAs, we inspected the 19 known
Arabidopsis miRNAs for unique representation in public
Arabidopsis MPSS datasets and in our own MPSS datasets derived from a variety of tissues and conditions (see Materials and methods for details) [
42-
44]. We compared the intergenic genomic sequence flanking the 19 known
Arabidopsis miRNAs with the MPSS data. We found 30 MPSS signature sequences that were identical to subsequences within the flanking 500-bp sequences either upstream or downstream of 14 known miRNAs (see Additional data file 3). All 30 MPSS sequences were reported in both the public and private MPSS datasets. They occurred upstream, downstream or partially overlapping with known mature miRNAs. Despite the highly repetitive nature of the
Arabidopsis genome, 28 of the 30 MPSS signatures mapped uniquely to only one miRNA locus, with no matches elsewhere in the genome. Two genomic loci were found for each of the two exceptional MPSS signatures MPSS78528 and MPSS28409. For MPSS78528, the associated miRNA mir162 appeared twice in the
Arabidopsis genome (upstream of At5g08180 and upstream of At5g23060) and the MPSS sequence mapped exactly to those regions. For MPSS28409, its second genomic match was on the opposite strand of an intron in gene At3g04740, which was very unlikely to be a source for MPSS sequences because samples for MPSS were prepared from mRNA or other type of polyadenylated RNA molecules, in which introns should have been processed. Thus, the MPSS data accurately reflected the expression of 14 known
Arabidopsis miRNAs from a total of 19, indicating that it can be used as a source of indirect experimental support for the expression of predicted miRNAs.
We then assessed the presence of MPSS signature sequences for the 83 predicted miRNAs. Using the approach described above, 23 MPSS signature sequences corresponding to the flanking sequences of 16 predicted miRNAs were found (see Additional data file 1). All 23 MPSS signature sequences were present in both the public and our own MPSS datasets, and mapped uniquely to the miRNA flanking sequence. The expression of nine miRNA candidates supported by MPSS data was also tested by northern blot hybridization, with eight of them producing a positive signal. Another three miRNAs with MPSS data were found in the ASRP database (see previous section and Additional data file 1). These results indicate that MPSS data indeed represent the expression of predicted miRNAs.
Comparison of predicted miRNAs to known Arabidopsis miRNAs
To explore the relationship of predicted miRNAs to known Arabidopsis miRNAs, we compared the sequences of all 83 miRNA candidates from our prediction with sequences of the 19 known Arabidopsis miRNAs. Eight predicted Arabidopsis miRNAs exhibited high sequence similarity to one or more known Arabidopsis miRNAs and could be grouped into five clusters (Figure ). We could not find convincing evidence that Arabidopsis and animal miRNAs are related, as clustering of these required the insertion of multiple gaps in the alignments (data not shown).
Putative mRNA targets of predicted Arabidopsis miRNAs
A previous study has predicted that most known plant miRNAs bind to the protein-coding region of their mRNA target with nearly perfect sequence complementarity, and degrade the target mRNA in a way similar to RNA interference (RNAi) [
29]. Analysis of several targets has now confirmed this prediction, making it feasible to identify plant miRNA targets [
12,
15,
16]. We developed a computational method based on the Smith-Waterman nucleotide-alignment algorithm to predict mRNA targets for the 83 newly identified miRNA candidates reported in this paper (see Materials and methods for details). Focusing on miRNA complementary sites that were conserved in both
Arabidopsis and
O. sativa, our method was able to identify 94% of previously confirmed or predicted mRNA targets for known conserved
Arabidopsis miRNAs. Applying the method to the 83 predicted
Arabidopsis miRNA candidates and their
O. sativa orthologs, we predicted 371 conserved mRNA targets for 77 predicted
Arabidopsis miRNAs, with an average of 4.8 targets per miRNA. The signal-to-noise ratio of the miRNA targets prediction was 4.1:1 when using randomly permuted sequences with the same nucleotide composition to miRNA sequences as negative controls that went through the same target prediction process. A complete list of these predicted target mRNAs and their pairings with miRNA sequences is available in Additional data file 4.
Of the 371 predicted miRNA targets, 10 were potential targets of two independent miRNAs, one (At3g54460 mRNA) was a potential target of three different miRNAs (At1g60020_5_14, At3g27883_1009, At5g62160_613_rc), and the rest were targets of a single miRNA. We assessed the biological functions of all predicted miRNA targets using gene ontology (GO) [
45]. GO terms for 254 targets were found in the molecular function class. Molecular functions of the putative miRNA targets included transcription regulator activity, catalytic activity, nucleic acid binding, and so on, as summarized in Table . As some proteins were classified in more than one molecular function category, the total number of targets listed in different function categories in Table exceeds the number of targets with GO function assignment.
| Table 2Analysis of predicted miRNA target functions using GO annotation |
Consistent with previous reports [
29], a large proportion of predicted targets encoded proteins with transcription regulatory activity, corresponding to 50% of total targets with GO annotation (129/254). One interesting phenomenon was that most transcription regulators in the miRNA target set were plant specific, such as MYB, AP2, NAC, GRAS, SBP and WRKY family transcription factors (Table ). For example, the miRNA target set included 10 plant specific NAC-domain-containing transcription factors, corresponding to 9% of total NAC-domain-containing transcription factors encoded by the
A. thaliana genome. In contrast, 139 genes encoding a general transcription factor bHLH were found in the
A. thaliana genome, but only three were putative miRNA targets.
| Table 3Family specificity of putative miRNA-targeted transcription factors |
We analyzed the expression patterns of potential targets to look for indications that they were under miRNA regulation. Twelve of the 14 miRNAs confirmed by northern blot hybridization showed an increased accumulation in flower tissue compared to the other tissues tested (Figure ), suggesting a role for miRNAs in regulating flower-specific events. In a search of
Arabidopsis microarray gene expression data available from The
Arabidopsis Information Resource (TAIR) [
46], we found the expression profile for 11 predicted mRNA targets that can base-pair nearly perfectly with five confirmed flower-abundant miRNAs. We hypothesized that expression levels of these targets in flower tissue could be decreased as compared to whole plant RNA samples as a result of mRNA cleavage induced by miRNA regulation. Accordingly, a reduced expression level (more than 1.25-fold decrement) was found for eight genes in total flower mRNA compared to total whole plant mRNA, with another three whose expression was almost unchanged (Table ). A
t-test on the possibility of decreased expression between transcripts listed in Table and in the entire microarray data resulted in a
p-value of 0.04, indicating that the decreased expression observed for predicted miRNA targets is significantly different from the general expression pattern of the entire microarray data.
| Table 4Flower microarray expression data for putative targets of miRNAs identified by northern blot hybridization |
Target mRNA fragments resulting from miRNA-guided cleavage are characterized by having a 5' phosphate group, and cleavage occurs near the middle of the base-pairing interaction region with the miRNA molecule. Using a modified RNA ligase-mediated 5' rapid amplification of cDNA ends (5' RACE) protocol, we were able to detect and clone the At3g26810 mRNA fragment corresponding precisely to the predicted product of miRNA processing (Figure ). Two other genes, At3g62980 (TIR1) and At1g12820, share extensive sequence homology with At3g26810 and were also predicted to be targets of miR393a. Consistent with this, we also identified the corresponding RNA fragments derived from miRNA cleavage by 5' RACE (data not shown). We were not able to identify other targets from flower RNA samples using a similar approach. The microarray data used in this tissue comparison experiment includes around 7,400 genes only (about a quarter of the entire Arabidopsis genome). Thus, we expect the expression profile of more mRNA targets to be determined as more whole-genome tissue comparison data is available.