|Home | About | Journals | Submit | Contact Us | Français|
The recent discoveries of microRNA (miRNA) genes and characterization of the first few target genes regulated by miRNAs in Caenorhabditis elegans and Drosophila melanogaster have set the stage for elucidation of a novel network of regulatory control. We present a computational method for whole-genome prediction of miRNA target genes. The method is validated using known examples. For each miRNA, target genes are selected on the basis of three properties: sequence complementarity using a position-weighted local alignment algorithm, free energies of RNA-RNA duplexes, and conservation of target sites in related genomes. Application to the D. melanogaster, Drosophila pseudoobscura and Anopheles gambiae genomes identifies several hundred target genes potentially regulated by one or more known miRNAs.
These potential targets are rich in genes that are expressed at specific developmental stages and that are involved in cell fate specification, morphogenesis and the coordination of developmental processes, as well as genes that are active in the mature nervous system. High-ranking target genes are enriched in transcription factors two-fold and include genes already known to be under translational regulation. Our results reaffirm the thesis that miRNAs have an important role in establishing the complex spatial and temporal patterns of gene activity necessary for the orderly progression of development and suggest additional roles in the function of the mature organism. In addition the results point the way to directed experiments to determine miRNA functions.
The emerging combinatorics of miRNA target sites in the 3' untranslated regions of messenger RNAs are reminiscent of transcriptional regulation in promoter regions of DNA, with both one-to-many and many-to-one relationships between regulator and target. Typically, more than one miRNA regulates one message, indicative of cooperative translational control. Conversely, one miRNA may have several target genes, reflecting target multiplicity. As a guide to focused experiments, we provide detailed online information about likely target genes and binding sites in their untranslated regions, organized by miRNA or by gene and ranked by likelihood of match. The target prediction algorithm is freely available and can be applied to whole genome sequences using identified miRNA sequences.
MicroRNAs (miRNAs) are a novel class of gene products that repress mRNA translation or mediate mRNA degradation in a sequence-specific manner in animals and plants [1-4]. To date, several hundred different miRNAs have been identified from various organisms and their sequences are archived and accessible at the RFAM miRNA registry website [5,6]. Currently, this database contains 21 miRNAs from Arabidopsis thaliana, 48 from Caenorhabditis briggsae, 106 from C. elegans, 73 from D. melanogaster, 122 from Mus musculus, and 130 from Homo sapiens.
With few exceptions, the target genes and the mechanism of target suppression are currently unknown because reliable experimental methods for comprehensively identifying the miRNA targets have yet to be developed. Founding members of the miRNA family, lin-4 and let-7 in C. elegans, have a central role as key regulators of developmental timing through cell fate decisions [7,8]. Because these miRNA genes are also conserved in other animals and mammals [9,10], it is not surprising to find that homologous genes, which were initially identified by genetic interaction, also possess conserved miRNA binding sites . In insects, the bantam miRNA has been found to regulate cell proliferation and cell death by targeting the apoptosis gene hid (wrinkled) . D. melanogaster miR-14 has been implicated in fat metabolism and stress resistance as well as cell death, however the precise target genes of this miRNA have not been identified . The identification of animal miRNA targets is computationally difficult because animal miRNAs are relatively short and are only partially complementary to their mRNA targets, possibly because of additional interactions involving RNA binding proteins. As a result, it is challenging to define an algorithm and thresholds to predict reliably such target sites.
In contrast to animal miRNAs, some plant miRNA targets are more readily identified because of near-perfect complementarity to their target sequence . Many of these plant mRNA targets encode transcription factors that regulate morphogenesis [15-19]. As a consequence of near-perfect complementarity, plant miRNAs predominantly act as small interfering RNAs (siRNAs) guiding destruction of their mRNA target, though some have also been found to behave like animal miRNAs. In addition, up to this point, plant miRNA target sites are predominantly found within the protein-coding segment of the target mRNAs , while animal miRNAs appear to primarily target the 3' untranslated region (UTR) [4,12,20-25].
The miRNA and siRNA pathways overlap at several points. Both siRNAs and miRNAs are processed from double-stranded RNA precursors requiring dsRNA-specific RNase III enzymes [26-30]. By an unknown molecular mechanism, the excised small RNAs become associated with Argonaute member proteins to form an RNA-induced silencing complex (RISC) that is able to target near-perfect complementary RNAs for degradation or for the control of translation [31-34]. In animal systems, it was shown that the introduction of a certain number of mismatches at centrally located positions allows for a switch from targeted mRNA degradation to translational repression [35,36]. In general, however, mutations in siRNAs typically abolish gene silencing without switching to translational repression . Intriguingly though, siRNA-guided cleavage activity can be detected with sometimes distantly-related complementary sequences when siRNA specificity is evaluated at a genome-wide level .
About 10% of the miRNAs identified in invertebrates are also conserved in mammals, indicating that the regulatory function of these genes is likely to be conserved cross-species. Since miRNA-containing species have been separated by hundreds of millions of years of evolution, it is striking that many 22 nucleotide miRNAs do not exhibit stronger sequence divergence. This absence of sequence-evolution in many miRNAs suggests that these miRNAs have more than one target site and that evolution by compensatory base-pair changes has become extremely unlikely. Therefore, a miRNA may regulate few or many genes depending on its apparent birth date. It is also conceivable that additional evolutionary constraints, such as the presence of certain protein-binding sites within the miRNA-targeted mRNAs, are conferring specificity to these small RNA regulated processes.
In order to address the question of miRNA target identification in animals, we have developed a computational method to detect miRNA targets. This approach ranks the likelihood of each gene to be a miRNA target and conversely for each miRNA to target a gene (Figure (Figure1).1). The target prediction method relies on the maintenance of evolutionary relationships between miRNAs and their targets, using three completely sequenced insect genomes (D. melanogaster, D. pseudoobscura and A. gambiae). We identify distinct networks of miRNA-mediated gene regulation such as the control of cell fate, morphogenesis and nervous system function, which appear to be preferentially targeted by miRNAs .
Every gene in D. melanogaster is a potential target for one or more of the characterized D. melanogaster miRNAs [40-43]. Reliable identification of miRNA targets is qualitatively different from standard sequence similarity analysis and requires new methods. In traditional sequence analysis, one tries to assess the likelihood of a hypothesis, for example, whether similarity between two sequences is due to common ancestry and continuity of functional constraints or a chance occurrence. Here, however, we aim to assess the likelihood of an actual physical interaction between two molecular species with phenotypic consequences, and have to assume that the interacting molecular species are present in the cell at the same time and at sufficient effective concentrations to facilitate the interaction. One of these species is a mature miRNA, the other a full length and translation-competent mRNA, or, in this study, its 3' UTR. The analysis is complicated by the fact that the sequence of the miRNA is small (approximately 22 nucleotide) and that the interaction may be substantially affected by a protein complex, that is, the interaction is probably not simple hybridization by optimal base pairing. To address these difficulties, at least in part, we have developed a three-phase method (miRanda) for target site identification from sequence information (Figure (Figure1).1). The three phases are as follows: sequence-matching to assess first whether two sequences are complementary and possibly bind; free energy calculation (thermodynamics) to estimate the energetics of this physical interaction; and evolutionary conservation as an informational filter. We have validated this method using experimentally verified target genes (and target sites) from the literature [4,20-25] and against a randomized background model (see Materials and methods).
Using each of the 73 available D. melanogaster miRNAs as probes, we scan the 3' UTRs of 9,805 D. melanogaster genes for possible complementarity matches using a dynamic programming algorithm. For the remaining genes of D. melanogaster, accurate 3' UTR sequences were not available. The algorithm takes into account G-U wobble pairs, allows moderate insertions and deletions and, importantly, uses a weighting scheme that rewards complementarity at the 5' end of the miRNA, as observed in known miRNA:target-mRNA duplexes. In addition, we have applied position-specific empirically defined rules (see Materials and methods). The result is a score (S) for each detected complementarity match between a miRNA and a potential target gene.
For each match, the free energy (ΔG) of optimal strand-strand interaction between miRNA and UTR is calculated using the Vienna package . We cannot, however, take into account the energy of interaction with possible protein components, as these details are at present largely unknown .
Given imperfect rules for sequence pairing and energy estimation, the conservation of predicted miRNA-target pairs in closely-related species is an important additional criterion for this analysis. Given the surprisingly high level of sequence conservation of miRNAs across phyla [9,45], we assume that the set of miRNAs in D. melanogaster is shared identically with D. pseudoobscura and A. gambiae. We only consider a miRNA target pair to be conserved across species if the following three criteria are met: a specific miRNA independently matches orthologous UTRs in both species; sequences of detected target sites in both species exhibit more than a specified threshold of nucleotide identity (ID) with each other; and the positions of both target sites are equivalent according to a cross-species UTR alignment  (see Materials and methods).
For this three-phase assessment of miRNA-target matches, we use cut-off values that provide a balance between false positives and false negatives, by inspection of known targets (see Materials and methods). The thresholds for sequence conservation (≥80% identity for D. pseudoobscura and ≥60% identity for A. gambiae) were chosen after extensive analysis of alignments between D. melanogaster, D. pseudoobscura and A. gambiae 3' UTR sequences. To maximize predictive power, we have kept the number of adjustable parameters and cutoffs as small as possible (see Materials and methods).
We take into account potential many-to-one relationships between miRNAs and their target genes using an additive scoring scheme. This system allocates a score to a miRNA (or multiple miRNAs) and target gene by summing over all scores for all conserved target sites detected for that pair (see Materials and methods). All predictions are then sorted and ranked according to this scheme. This means that miRNA target predictions with high scores or multiple detected sites, or both, are ranked preferentially. Finally, for each miRNA, the ten highest scoring target genes are selected for further analysis. These represent our highest quality target predictions for each miRNA, and, as such, are suitable for experimental validation, although lower-ranked predictions are also available (see Additional data files).
Application of our target prediction method to experimentally verified targets serves as initial validation. However, as the method was developed using known target sites as a guide [4,12,20-25], independent validation and refinement of the method will depend on future experiments.
The method correctly identifies nine of the ten currently characterized target genes (Table (Table1)1) for the miRNAs lin-4 and let-7 in C. elegans and bantam in D. melanogaster. At this threshold level, the details (position and base pairing as reported by others) of most target sites are largely reproduced, together with interesting alternative target sites on the known target genes. This comparison is only partially conclusive, as not all reported target sites on known target genes have been individually verified by experiment. The missed duplex between the lin-4 miRNA and its reported target gene (lin-14) (Table (Table1)1) contains an unusually long loop structure in the target sequence, which cannot easily be detected without adversely affecting the rate of false positive detection. In conclusion, we not only detect the majority of known miRNA targets, but the rankings obtained from our additive scoring scheme for these targets are also consistently high (Table (Table1).1). For example, the two target genes of the let-7 miRNA (hbl-1 and lin-41) are detected as the number 1 and number 2 ranked genes hit respectively, from a scan against 1,014 C. elegans 3' UTR sequences.
We also perform control calculations by running multiple trials using randomized miRNA sequences (see Materials and methods). A simple estimate of the rate of false positive results from applying the same fixed match, energy and conservation thresholds to target sites using actual and randomized miRNA sequences and assuming that all above-threshold target sites using random sequences are not biologically meaningful. The estimated rate of false positives is (R-/R+), where R- is the total number of above-threshold hits for randomized miRNA sequences and R+ is the total number of above-threshold hits for the actual miRNAs. The overall estimate is a false positive rate of 35% (Table (Table2).2). Interestingly, target genes for actual miRNAs with two or more conserved sites occur 11 times more frequently than for randomized miRNAs, representing a much lower false positive rate of 9% (Table (Table2).2). This apparent increase in reliability for predictions with multiple sites per target gene may be related to cooperativity of miRNA-target interactions.
Finally, we test whether specific functional classes of miRNA target genes are highly over-represented among the predicted miRNAs compared to random expectation (Table (Table3).3). For each GO molecular function class  we count the number of conserved targets detected for actual miRNAs and for the sequence-randomized miRNAs. Statistical significance of over-represented classes is measured by a Z-score (see Materials and methods). These results indicate that some functional classes (for example, translational regulators and apoptosis regulators) are significantly enhanced among miRNA target gene predictions.
We report 535 potential target genes for the 73 known D. melanogaster miRNAs in decreasing order of match score for sites in detected 3' UTR targets. All results are available online . All of these targets have passed the filters for free energy estimates, and conservation of target site sequence and position between D. melanogaster and D. pseudoobscura (see Materials and methods).
Of these predicted target genes in D. melanogaster, 264 have some functional annotation [48,49], while 231 have more than one predicted target interaction site. Results obtained from our random model suggest that 3' UTRs with more than one predicted target site for a given miRNA are more reliable than those with a single site. For this reason, the most promising candidates for target validation experiments may be the 231 target genes with multiple miRNA target sites. In general, cooperative binding of miRNAs to a single target gene can involve multiple hits by one or several distinct miRNAs. Specific examples are: The eye pigmentation gene brown (bw) is hit by the miRNAs bantam (three sites) and miR-314 (two sites); the apoptosis gene hid/wrinkled (W) is hit by bantam (two sites), miR-309 and miR-286; and the eye development gene seven-up (svp) is hit by miR-33 (two sites), miR-124, miR-277 and miR-312.
Given the strong conservation of miRNA sequences between D. melanogaster and A. gambiae , we searched for a subset of the above targets also conserved in the 3' UTRs of A. gambiae genes using the same procedure (Figure (Figure1).1). In this way, we find 150 potential targets in corresponding A. gambiae genes. Of these, some 40% exhibit target site conservation of more than 60% identity to the corresponding D. melanogaster site. Notable examples are Scr (miR-10), netrin-B (miR-184, miR-284), sticks and stones (miR-282; two hits), and VACht (miR-9) (notation is: target gene (miRNA)).
Given the availability of an essentially complete set of protein coding genes and (perhaps) nearly all miRNAs in D. melanogaster, one can identify both biological processes and molecular functions predicted to be preferentially targeted by miRNAs. Processes we find to be over-represented by detected miRNA targets include: transcriptional control, translational control, cell-adhesion, enzyme regulation and apoptosis regulation (Figure (Figure2).2). Separate groups of miRNAs appear to be specific to particular functional classes of target genes: for example, a group of seven miRNAs, miR-281, miR-311, miR-79, miR-92, miR-305, miR-131 and miR-31a, enriched two to four times in target genes in larval development; a group of five, bantam, miR-286, miR-309, miR-14 and miR-306, enriched three to six times in targets implicated in death or cell death; and the group bantam, miR-286 and the miR-2/miR-13 family, enriched five to six times in genes involved in regulation of apoptosis. A group of nine miRNAs is also two to three times enriched in genes involved in pattern specification. Overall, target genes annotated as transcription factors are detected twice as frequently as expected by chance (21% of annotated identified target genes, compared to 9.5% of all annotated FlyBase  genes). Translation factors are increased four times over expectation (miR-318, miR-304, miR-276b). This may represent a more general control mechanism for miRNA regulation of translation, in addition to the specific control of translation of individual target genes.
Investigating possible connections between genomic location and function, we analyzed 12 clusters of miRNAs in the D. melanogaster genome the members of which are potentially co-expressed, for example, let-7, miR-125 and miR-100 [40-43]. Contrary to expectation, we did not find any obvious links between genomic location and predicted target gene. One possible exception is the link between the position of three of the five copies of the miR-2 family in the intron of the gene spitz (involved in growth) and one of its top target genes, reaper (involved in cell death).
The well-characterized miRNA let-7 (in C. elegans) has two annotated top ten targets: tamo and lar. The gene tamo is thought to be required for the nuclear import of the NF-κB homolog Dorsal and recent work has connected it to the expression of a small RNA regulated by ecdysone [51,52]. We do not predict hunchback as a target of let-7 in D. melanogaster, given the thresholds and parameters used, although some below-threshold hits do appear to be conserved in D. pseudoobscura. Instead, we predict miR-12 and miR-184 to have a stronger effect on hunchback.
Many genes involved in the maternal genetic system, determining germ cell fate and anterior-posterior polarity of the egg, are well known to be translationally regulated. We clearly predict the following subset of these genes to be miRNA targets: germ cell less, bicoid, hunchback, caudal, staufen, arrest (bruno-1) and bruno-2. In addition, although the genes oskar and nanos are not top-ten ranked predictions, oskar has two conserved target sites (miR-3, miR-6), ranked below the top-ten hits for these miRNAs, and nanos has five strongly predicted target sites for miR-9c (three) and miR-263b (two), but below the 80% conservation threshold. Taken together, these data may indicate that this system is at least partially under miRNA regulation. We detail below three more biological processes predicted to be subject to miRNA regulation.
A multi-tiered hierarchy of transcription factors establishes the morphological segmentation and diversification of the anterior-posterior body axis of the Drosophila embryo . The Hox genes (lab, pb, Dfd, Scr, Antp, Ubx, abd-A and Abd-B) play a key role in diversification by switching the fates of embryonic segments between alternative developmental pathways . The genes are organized in two separate clusters on chromosome 3R, the Antennapedia (lab, pb, Dfd, Scr, Antp) and Bithorax (Ubx, abd-A, Abd-B) complexes. Both the genes and their relative order within the complexes are conserved in vertebrates .
Our predictions indicate that five of the eight Hox genes are regulated by miRNAs (Table (Table4).4). The 3' UTR of Scr is a potential target of miR-10, which is located within the Antennapedia complex between Dfd and Scr (and similarly, near the homologs of Dfd (hox4) in the Hox gene clusters of A. gambiae, Tribolium castaneum, zebrafish, pufferfish, mouse and human ). Scr is also a strong target for bantam, the miRNA associated with the apoptosis gene hid , and for miR-125, the putative Drosophila homolog of the miRNA lin-4 in C. elegans. Another strong target match for miR-125 is ftz, which is involved in the regulation of Hox genes and lies within the Hox cluster between Scr and Antp. All three of the Bithorax complex genes are likely to be regulated by multiple miRNAs. Interestingly, abd-A and Abd-B are both targeted by miR-iab-4-3-p, which is located within the complex between abd-A and Abd-B.
Aside from the Hox genes themselves, several other regulators of Hox gene function also appear to be miRNA targets. These include members of the trithorax activator (trx, trr) and the Polycomb (Pc) repressor groups, which control the spatial patterns of Hox gene expression by maintaining chromatin structure , and homothorax (hth), which is required for the nuclear translocation of the Hox cofactor extradenticle .
Ecdysone signaling triggers and coordinates many of the developmental transitions in the life cycle of Drosophila. Ecdysone pulses occur during embryonic, the three larval instar, prepupal, pupal and adult stages and regulate numerous physiological processes including morphogenetic cell shape changes, differentiation and death [59-62]. The regulation of these diverse processes by ecdysone is achieved through a complex genetic hierarchy. At the top of the hierarchy is the ecdysone receptor (EcR), a member of the nuclear hormone receptor family; it regulates the expression of different sets of transcription factors, including the zinc finger proteins of the Broad-Complex and many other nuclear hormone receptors, which in turn control key regulators of the different physiological processes.
Our predictions show many potential miRNA targets at several levels of the ecdysone cascade. These include EcR and several of the downstream transcription factors and co-factors (for example, br, E71, E74, E93, crol, fkh) [63-66]. Interestingly, broad (br), whose expression is exquisitely timed and differentially controlled in different tissues, has seven alternate splice forms with five different 3' UTRs. All five 3' UTRs contain high-ranking predicted targets for miRNA regulation (Table (Table5).5). The fact that three miRNAs control different splice forms in varying combinations supports the analogy made to transcriptional regulation for describing the combinatorial mechanisms to achieve specificity and redundancy in targeting genes.
In addition to the core transcription factors of the ecdysone cascade, several of its effector pathways are likely to be directly targeted by miRNAs. These include genes in morphogenetic/stress signaling (aop, msn, slpr, hep), biogenesis (rab6) and the cell death pathway (hid, rpr, parcas, Rep2) [67-69]. They also include several miRNAs target genes that are involved in the biosynthesis of ecdysone (woc, CypP450s) and of other hormones triggering developmental transitions (amon/ETH, Eh) [70,71]. Despite their synchronous expression with ecdysone pulses in late larvae and pre-pupae , let-7 and miR-125 are not prominently targeting the core factors of the ecdysone cascade.
We predict a large number of miRNA target genes that are involved in cell fate decisions in the developing nervous system. In particular, we predict several miRNA target genes within components of the Notch pathway, which regulates the early decision between the neuronal and the ectodermal fate [72,73]. These include Notch ligands and factors regulating their stability (Ser, neur), as well as factors that bind to Notch (dx) or modify its sensitivity to ligands (sca, gp150, fng). They also include genes of the E(spl) complex (CG8328, CG8346) and of the Brd complex (Tom) . Genes in these two complexes are known to share motifs for translational regulation in their 3' UTR (Bearded- and K-box), some of which have previously been predicted to be miRNA target sites . In addition, our predicted miRNA target genes include factors involved in the asymmetric cell division of neuroblasts (insc, par 6) and transcription factors regulating different aspects of neuronal differentiation (Dr, jim, Lyra, nerfin, SoxN, svp, unc4).
The establishment of neural connectivity is a complex morphogenetic process comprising the growth and guidance of axons and dendrites, and the formation of synapses. Many miRNAs target these processes at several different levels (Figure (Figure3;3; Table Table6).6). These targets include a remarkably large number of secreted and transmembrane factors known to mediate axon guidance decisions (netrin A and B, Slit; Drl, Dscam, Eph, PTPs, Robos, Semas; FasI, beatIV) [76-78] (Table (Table6).6). All these factors are conserved and have similar axon guidance functions in vertebrates. Interestingly, netrin1 and robo are also predicted miRNA target genes in vertebrates (our unpublished observations), suggesting that translational control is an important conserved aspect of the regulation of axon guidance factors.
In addition to these cell surface factors, miRNAs target the cellular machinery that effects cell shape change and adhesion, including regulators and components of the cytoskeleton (for example, dock, trio, RhoGAPs, Rho1, Abl, tricornered, wasp, Sop2, nesprin, Khc-73, gamma-tubulin) and of the cell junctions (for example, crumbs, dlt, mbc, skiff).
Many of the developmental factors are re-employed in the mature nervous system to control synaptic function by effecting morphogenetic changes in synapse size, shape or strength. Additional miRNA target genes that are active in the mature nervous system include neurotransmitter receptors, ion channels and pumps (clumsy, DopR, nAChr; Hk, Shaker), as well as factors involved in neurotransmitter transport and synaptic release (SerT, vAChT, Eaat1; Cirl, Rab3, Sap47, unc13) .
Why would translational regulation by miRNAs feature so prominently in the development and function of the nervous system? The distances between the nucleus and dendrites/axon projections are relatively large, making nuclear regulation difficult. Furthermore, differential gene activity between and even within compartments (for example, between different portions of a growth cone or different branches of a dendritic tree) is crucial for neuronal function. Therefore, translational control near the site of action is a more efficient means of modulating gene activity than transcriptional control. For axon guidance, it has been shown that the relative abundance of adhesion molecules and chemotropic receptors on the surface of the growth cone is post-transcriptionally regulated in response to external cues presented by intermediate targets. Regulation by miRNAs would thus provide an excellent additional mechanism of post-transcriptional control.
The precise rules and energetics for pairing between a miRNA and its mRNA target sites, with probable involvement of a protein complex, are not known and cannot easily be deduced from the few experimentally proven examples. Therefore, any computational methods for the identification of potential miRNA target sites are at risk of having a substantial rate of false positives and false negatives. Based on analysis of the known examples, we have biased our method toward stronger matches at the 5' end of the miRNA, and used energy calculation plus conservation of target site sequence to provide our current best estimate of biologically functional matches. Overall, we find that conservation is a crucial filter and reduces the rate of prediction error.
Our results suggest that miRNAs target the control of gene activity at multiple levels, specifically transcription, translation and protein degradation, in other words, that miRNAs act as meta-regulators of expression control. Among biological processes, we find that the most prominent targets include signal transduction and transcription control in cell fating and developmental timing decisions, as well as morphogenetic processes such as axon guidance. These processes share the need for the precise definition of boundaries of gene activity in space and time. Our findings therefore support and expand earlier work on the role of miRNAs in developmental processes [42,80]. In addition, we predict that miRNAs also play an important role in controlling gene activity in the mature nervous system.
As miRNA and mRNA have to be present simultaneously at minimum levels in the same cellular compartment for a biologically meaningful interaction, more precise expression data as a function, for example, of developmental stage , will be extremely useful and will be incorporated in future versions of target prediction methods. Similarly, further work will include the analysis of potential target sites in coding regions and 5' UTRs, as well as conservation and adaptation of target sites in many species.
This genome-wide scan for potential miRNA target genes gives us a first glimpse of the complexity of the emerging network of regulatory interactions involving small RNAs (see Additional data). Both multiplicity (one miRNA targets several genes) and cooperativity (one gene is targeted by several distinct miRNAs) appear to be general features for many miRNAs, as already apparent with the discovery of the targets for lin-4 and let-7. The analogy of these many-to-one and one-to-many relationships to those of transcription factors and promoter regions is tempting and elucidation of the network of regulation by miRNAs will make a major contribution to cellular systems biology. In the meantime, we would not be surprised if experiments focusing on target candidates filtered in this way have a high rate of success and help to unravel the biology of regulation by miRNA-mRNA interaction.
An initial set of D. melanogaster miRNA sequences was built using the RFAM miRNA database . Mature miRNA sequences were placed in a FASTA formatted sequence file. In total, the final file contained 73 unique miRNA sequences. All sequences used for this analysis are available from RFAM and as supplementary material .
Sequences for D. melanogaster 3' UTRs were obtained from the Berkeley Drosophila Genome Project (BDGP). In total, 3' UTR sequences were available for 14,287 transcripts, representing 9,805 individual D. melanogaster genes. A corresponding set of D. pseudoobscura 3' UTR sequences was then built from the March 2003 first freeze of the D. pseudoobscura genome project at Baylor College of Medicine. Each D. melanogaster 3' UTR was mapped to D. pseudoobscura contigs by searching both the actual D. melanogaster 3' UTR sequence, using NCBI BLASTn , and the peptide sequence of each gene, using NCBI tBLASTn , against D. pseudoobscura contigs . Results from these two scans were then used to identify candidate 2,000 bp regions of D. pseudoobscura contigs, within which we believe an orthologous D. pseudoobscura 3' UTR is present. The AVID  alignment tool was used to align the real D. melanogaster 3' UTR and a candidate D. pseudoobscura region. Finally, this alignment was used to trim each candidate region, leaving the predicted D. pseudoobscura 3' UTR. In total 12,416 transcripts and 8,282 genes from D. pseudobscura were mapped to orthologous D. melanogaster UTRs in this fashion. The Ensembl database  Application Programming Interface was used to construct A. gambiae predicted 3' UTRs by taking 2,000 bp downstream from the last exon of each transcript. Orthology mappings between A. gambiae and D. melanogaster UTRs were then obtained by searching all Ensembl A. gambiae peptides against all D. melanogaster peptides using BLASTp. In total 9,823 A. gambiae genes were mapped to D. melanogaster genes in this manner.
The miRanda algorithm is similar to the Smith-Waterman algorithm , however, instead of building alignments based on matching nucleotides (A-A or U-U, for example), it scores based on the complementarity of nucleotides (A=U or GC). The scoring matrix used for this analysis also allows G=U 'wobble' pairs, which are important for the accurate detection of RNA:RNA duplexes . Complementarity parameters at individual alignment positions are: +5 for GC, +5 for A=U, +2 for G=U and -3 for all other nucleotide pairs. The algorithm uses affine penalties (linear in the length of a gap after an initial opening penalty) for gap-opening (-8) and gap-extension (-2). In addition, following observation of known target sites, complementarity scores (positive and negative values) at the first eleven positions are multiplied by a scaling factor (here set at 2.0), so as to reflect the observed 5'-3' asymmetry. Finally, the following four empirical rules are applied, with positions counted starting at the 5' end of the miRNA: no mismatches at positions 2 to 4; fewer than five mismatches between positions 3-12; at least one mismatch between positions 9 and L-5 (where L is total alignment length); and fewer than two mismatches in the last five positions of the alignment. With these parameters, the dynamic programming algorithm optimizes the complementarity score between a miRNA sequence and an mRNA sequence (typically a 3' UTR), summed over all aligned positions, and finds all non-overlapping hybridization alignments in decreasing order of complementarity score down to some cutoff value (default value 80). The detection of sub-optimal alignments follows heuristics previously used in sequence alignment [81,83].
In order to estimate the thermodynamic properties of a predicted duplex, the algorithm uses folding routines from the Vienna 1.3 RNA secondary structure programming library (RNAlib) . The expanded thermodynamic parameters used  are more computationally intensive than the initial scan, but allow potential hybridization sites to be scored according their respective folding energies. The miRNA sequence and 3' UTR sequence from a hybridization alignment are joined into a single sequence with an eight base sequence linker containing artificial 'X' bases that cannot base pair. This strand-linker-strand configuration assumes the phase space entropy of strand-strand association is constant for all miRNA-target matches [44,83]. The minimum energy of this structure with the last matching base pair (from initial sequence alignment) constrained is then calculated using RNAlib.
All miRNA sequences are scanned against the 3' UTR datasets of D. melanogaster, D. pseudoobscura and A. gambiae. The thresholds used for hit detection are: initial Smith-Waterman hybridization alignments must have S ≥ 80, and the minimum energy of the duplex structure ΔG ≤ -14 kcal/mol. Each hit between a miRNA and a UTR sequence is then scored according to the total energy and total score of all hits between those two sequences. Hits are deemed to be conserved in D. pseudobscura or A. gambiae if a target site equivalent to that detected in a D. melanogaster UTR can be found in the orthologous D pseudoobscura or A. gambiae UTR at the same position in the UTR alignments. Our definition of equivalence between target sites is that their sequences are more than 80% identical for D. pseudoobscura and 60% identical for A. gambiae. All results from the scan are then ranked and sorted according to total score of conserved target sites detected. For each miRNA, the ten highest ranked genes are selected as its candidate target genes in this way. Multiple miRNAs binding the same site on a target gene are resolved using a greedy algorithm that assigns the highest scoring and lowest free energy miRNA target duplex to each potential site so that different miRNA target sites cannot overlap.
For the initial validation, 3' UTR sequences for C. elegans and C. briggsae were obtained, if possible, from UTRdb . If unavailable, UTR sequences were estimated by taking 2,000 bp of flanking nucleotide sequence downstream of the last exon of the gene in question using the Ensembl database .
Control sequences for the randomized experiment were constructed by assembling 100 sets of 73 miRNAs each generated by random shuffling of each D. melanogaster miRNA. Each of these sets of 73 randomized miRNAs was independently searched against all D. melanogaster and D. pseudoobscura 3' UTRs as in the reference experiment. Results and counts were then averaged over all 100 random sets, and were compared with the results of the actual miRNA scan. For the functional analysis, GO classes for known D. melanogaster genes were obtained from FlyBase and conserved hits for the real and random miRNAs for each class are counted. The Z-scores are generated from the actual miRNA counts, averaged random miRNA counts and their standard deviations.
The following files are available with this article and at : a supplementary target functional plot (GO biological process; Additional data file 1); a supplementary target:miRNA network graph (Additional data file 2); an Excel table of the top 10 miRNA target predictions by miRNA (Additional data file 3); and an Excel table of the top 10 miRNA target predictions by gene (Additional data file 4). Additional data files of top 20, 30, and 40 both by microRNA and by gene and the source code of the core miRanda algorithm are available at .
A supplementary target functional plot
A supplementary target:miRNA network graph
An Excel table of the top 10 miRNA target predictions by miRNA
An Excel table of the top 10 miRNA target predictions by gene
We thank Gary Bader, Boris Reva, Carl Thummel and Mihaela Zavolan for interesting discussions and also Rebecca Ward, Ulrich Unnerstall and Kerstin Dose for support and assistance.
A previous version of this manuscript was made available before peer review at http://genomebiology.com/2003/4/11/P8