Although the analysis of sequenced genomes to date has focused most heavily on the protein-coding set of genes, all genomes also contain a constellation of non-coding RNA genes. With the exception of certain classes of RNAs with strongly conserved sequences and/or structures, such as ribosomal and transfer RNAs, identification of most non-coding RNAs has historically been a relatively serendipitous affair. Only very recently have there been concerted efforts to identify such genes systematically, using both experimental and computational approaches [
1].
Our collective ignorance of the totality of non-coding RNA genes was laid bare by recent work on microRNAs (miRNAs), an abundant family of 21-22 nucleotide non-coding RNAs [
2,
3]. The founding members of this family, lin-4 and let-7, were identified through forward analysis of extant
Caenorhabditis elegans mutants [
4,
5]. Both of these RNAs are post-transcriptional regulators of developmental timing that function by binding to the 3' untranslated regions (3' UTRs) of target genes [
5-
8]. Although they were long regarded as genetic curiosities possibly specific to nematodes, let-7 was subsequently found to be broadly conserved across bilaterian evolution [
9] and miRNA genes are now recognized as a pervasive and widespread feature of animal and plant genomes [
10-
16].
In general, it is thought that miRNA biogenesis proceeds via intermediate precursor transcripts of more than 70 nucleotides that have the capacity to form an extended stem-loop structure (pre-miRNA), although at least some pre-miRNAs are further derived from even longer transcripts (primary miRNA transcripts, or pri-miRNAs). These can exist as long individual pre-miRNA precursor transcripts, as operon-like multiple pre-miRNA precursors, or even as part of primary mRNA transcripts. Processing of pri-miRNA into the pre-miRNA stem-loop occurs in the nucleus, while subsequent processing of pre-miRNA into 21-22 mers is a cytoplasmic event mediated by the RNAse III enzyme Dicer [
17-
20]; Dicer is also responsible for cleavage of long perfectly double-stranded RNA into 21-22 nucleotide fragments during RNA interference (RNAi) [
2,
21]. These latter molecules, known as silencing RNA (siRNA), bind to and trigger the degradation of perfectly homologous mRNA molecules via RISC, a double-strand RNA-induced silencing complex containing nuclease activity [
22,
23].
Although the
in vivo function of only a few miRNAs is known so far, it is believed that the vast majority are likely to participate in post-transcriptional gene regulation of complementary mRNA targets. Interestingly, perfect or near-perfect target complementarity is associated with mRNA degradation [
24-
26], similar to the effects of siRNA, whereas imperfect base-pairing is associated with regulation by translational inhibition [
6,
27]. Recently, siRNAs with imperfect match to target mRNA were observed to function as translational inhibitors [
28], suggesting that the type of 21-22 nucleotide RNA-mediated regulation may be largely determined by the quality of target complementarity.
The vast majority of the approximately 300 miRNAs currently known were identified through direct cloning of short RNA molecules. Although this method has been quite successful thus far, its practicality is limited by the necessity for a considerable amount of RNA as raw material for cloning, and cloned products are often dominated by a few highly expressed miRNAs. For example, 41% of miRNAs cloned from HeLa cells are variants of let-7, 28% of human brain miRNAs are variants of miR-124, and 45% of miRNAs cloned from human heart and 32% of those cloned from early
Drosophila embryos are miR-1 [
10,
29]. In fact, it has been opined that few additional mammalian miRNAs will be easily identified by the direct cloning method [
30].
As a complementary approach to miRNA identification, we developed an informatic strategy ('miRseeker') and applied it to the completed genomes of
Drosophila melanogaster and
D. pseudoobscura, which are some 30 million years diverged. miRseeker subjects conserved intronic and intergenic sequences to an RNA folding and evaluation procedure to identify evolutionarily constrained hairpin structures with features characteristic of known miRNAs. The specificity of this computational procedure was shown by the presence of 18 out of 24 reference miRNAs within the top 124 candidates. We identified a total of 48 novel miRNA candidates whose existence was strongly supported by conservation in other insect, nematode or vertebrate genomes. Expression of 24 novel miRNA genes was verified by northern analysis (including 20 out of 27 candidates that were supported by third-species conservation and 4 out of 11 high-scoring predictions specific to
Drosophila), demonstrating that the bioinformatic screen was successful. As might be expected, the newly verified miRNA genes vary tremendously with respect to abundance and developmental expression profile, suggesting diverse roles for these genes. Inference of our false-positive prediction and false-negative verification rates (based on our ability to identify known miRNAs and detect the expression of highly conserved, and thus presumed genuine, novel miRNAs) leads us to estimate that drosophilid genomes contain around 110 miRNA genes, or nearly 1% of the number of predicted protein-coding genes. In combination with other concurrent genomic analyses [
31-
34], it is likely that most miRNAs in completed animal genomes have now been identified. Collectively, this sets the stage for both genome-wide and targeted studies of this functionally elusive family of regulators.