The clustering propensity of miRNAs and the enrichment of new miRNAs in the vicinity of known miRNA genes have been suggested previously (
8,
9,
12,
14,
30,
34), but have not been comprehensively studied in human. In this study, we systematically evaluated the clustering of miRNA genes in the human genome, exploring different aspects of miRNA clustering. We applied both computational and experimental approaches to rigorously define the clusters on the one hand, and to search for new miRNAs in the vicinity of known ones on the other hand. In using the clustering property for predicting new miRNAs in the vicinity of already known ones, we used the same clustering criteria that were used in the analysis of the known miRNAs.
A pairwise distance analysis of same-strand adjacent miRNAs shows that the distances between the known miRNA genes are smaller than expected at random. shows that for very small distances (up to 100 nt), the fraction of clustered miRNA genes even exceeds that of exons, probably due to their very small size. Especially interesting is the comparison with other RNA genes, where miRNAs are more clustered than all other non-coding RNAs in the very short pairwise distances, but other RNA types, like snoRNA and tRNA exceed them in other distance ranges.
By comparing the frequency of miRNA-like sequences in the vicinity of previously known miRNA genes to their frequency in random regions, we were able to demonstrate that there is a significant enrichment in miRNA-like sequences in the flanking regions of NMH miRNAs. Interestingly, the frequency of miRNA-like conserved stem–loops near MH miRNAs, although statistically significantly higher than in random sequences, seems to be less substantial. We suggest that this difference between the NMH and MH sets results from the different cluster definitions we have used. Unlike the NMH miRNA set, where the search region was limited to at most 3000 nt on each side, the search region for MH miRNAs was defined to be their hosting intron or UTR, which could vary considerably in length. In fact, >40% of the MH regions exceed 104 nt. However, except for one MH miRNA, all the known MH miRNAs that were found to be clustered in this study had pairwise distances <3000 nt. This observation still holds for the two MH miRNA predictions that could be verified experimentally: miR-452 that resides 969 nt upstream to miR-224 and miR-425 that resides 381 nt downstream to miR-191 ( and ). This implies that MH clustered miRNAs tend to be closely clustered even within their introns, while the rest of the intron does not show statistically significant enrichment with miRNA-like sequences. Thus, overall, we conclude that NMH and MH miRNAs have similar clustering distance thresholds. The preference for relatively small distances of clustered miRNAs within pre-mRNAs implies that miRNA clustering might be beneficial not only for shared transcription but also for other stages of miRNA processing, such as cleavage or transport.
The analysis scheme that we applied for the study of the miRNA-like properties shares many of its features and methodologies with previously described methods for the prediction of new miRNA genes [reviewed in (
5)]. Still, there are several interesting insights that may have implications for the improvement of future algorithms, especially regarding the conservation pattern. As miRNAs are highly conserved across different organisms, all previously reported algorithms for the prediction of animal miRNAs relied on this trait (
8,
29,
30,
32–
36). Most methods emphasize the importance of a multi-organism comparison [e.g. (
29,
30,
32,
33)]. However, this was usually performed by merging the results of several pairwise comparisons. In this study, we used the UCSC phastCons conservation scores (
22,
23) (
http://genome.ucsc.edu), which were based on a multiple alignment of five organisms (human, mouse, rat, chimpanzee and chicken). Since these scores were calculated for each position along the human genome, we could easily derive the conservation patterns of known miRNAs and their proximal regions. Interestingly, miRNAs seem to have a typical conservation pattern which is mainly characterized by its relatively short width and high peak. We have also noticed that there is a symmetric saddle-like pattern, which stems from the more divergent nature of the loop in the structure. This type of pattern was previously observed in a pairwise alignment of two species of
Drosophila (
29), and very recently in a multiple alignment of 10 primates (
30). In both cases, however, the compared organisms were closely related. We believe that the derivation of the pattern properties from multiple alignments of distant organisms filters out other conserved regions around the miRNA, which may mask the typical conservation patterns. Indeed, the conservation patterns demonstrated in our study are more pronounced than those observed in the alignments of the 10 primates (
30).
Our analysis demonstrated that explicit incorporation of the pattern conservation property as a miRNA-like attribute is very powerful in filtering out random intergenic sequences. This is supported by the findings of Berezikov
et al. (
30). Interestingly, the known miRNAs that we failed to identify by our computational procedure were missed because they did not adhere to the conservation criterion. Among these, half had a longer ‘middle’ gap than we allowed, suggesting that allowing a longer ‘trough’ in the conservation pattern may improve the predictions.
In this study, we focused on the characterization of clusters of same-strand miRNA genes that are relatively close (or reside within the same genomic unit), using high resolution conservation scores derived from multiple-organism alignments, and explicitly including the conservation pattern attributes in the filtering process. Our results show that the extracted miRNA-like features highly succeed in identifying the miRNAs and in filtering noise. We were able to identify 100% of the MH miRNA genes, which were excluded from the training set and kept as a test set. When applied to identification of miRNA sequences in the vicinity of previously known miRNA genes, selecting among the conserved regions the ones with folding potential reduced the predictions by >50% (), and application of the additional criteria narrowed down these remaining sequences by an additional 50% (). All in all, the various filters narrowed down the predictions in the vicinity of miRNA genes by ~80%, resulting in 97 predictions: 18 of these predictions could be supported either by cloning results or by sequence similarity. The 18 miRNA predictions were all within <3000 nt apart from their nearest clustered member, raising to 42% the proportion of clustered human miRNA genes with pairwise chromosomal distances of at most 3000 nt. The determination of 3000 nt as the distance threshold may lead to an underestimation of the number of miRNA clusters in human. In a recent study, Baskerville and Bartel (
18) demonstrated that proximal pairs of miRNAs tend to coexpress, and that the correlation in expression dropped when the distance between the miRNA pairs exceeded 50 kb. They suggested that clustered miRNAs are expected to be found within this range, which is one order of magnitude larger than our threshold. Indeed, if we set the distance threshold to 10

000 nt, the fraction of clustered miRNA genes rises to 48%. Also, expressed sequence tag evidence indicates that distant miRNAs may reside on the same transcript (e.g. miR-100, let-7 and miR-125). Thus, by adding transcript considerations to the definition of clusters, the number of clusters may further increase. Still, with the stringent definitions used here we demonstrate a strong phenomenon of miRNA clustering.
The validated miRNAs encoded in the vicinity of previously known ones, together with neighboring sequences with miRNA-like properties, revealed new miRNA clusters and increased the number of known members of some of the previously identified clusters. The polycistronic organization of miRNA genes may have important implications for the evolution of miRNA sequences.