miRNAs are now expected to regulate approximately 80% of genes
[1]. They are mainly involved in post transcriptional regulation through transcript disruption and translational blockade. miRNAs can be found in the intergenic, intronic as well as exonic regions
[2],
[3]. In animal systems, once transcribed from the genome either by RNA Polymerase II
[4] or by RNA Polymerase III
[5], the transcripts (primary miRNAs) are recognized by the Drosha-DGCR8 microprocessor complex. This complex cleaves pri-miRNAs into pre-miRNAs. Thereafter, Exportin transports these pre-miRNAs to the cytoplasmic space, utilizing Ran-GTP transport pathway
[6]. In cytoplasm, another RNAse III, Dicer, cleaves the pre-miRNA into mature miRNA duplex
[7].
The initial approaches for miRNA precursor discovery relied mainly on detection of the hairpin-shaped structure, which is common to all pre-miRNAs. However, groups like Bentwich
et al suggested that there are approximately 11 million hairpins in human genome, making it a daunting task to correctly identify miRNA precursor candidates
[8]. Some novel features and rules thus became imperative for better identification of miRNAs. Initially, one approach was designed for
C. elegans
[9], on the basis of the degree of conservation of miRNAs across various species. Similar approach was adopted by miRSeeker
[10] designed for
Drosophila. While searching for homology, miRSeeker also considered sequence conservation along with criteria like base pairing and presence of miRNAs in at least one of the arms of the hairpin sequences. Although these tools were important milestones, issues with their accuracy and consistency persisted widely, leading to development of better approaches. Later, Bentwich
et al developed PalGrade
[8], which assigned a stability score to every hairpin, depending upon its secondary structure. It also implemented a scoring scheme based on various features like hairpin length, loop length, sequence repetitiveness, bulge length and type of inverted repeat. Berezikov’s group analyzed genomic regions with conserved profiles, employing Phylogenetic Shadowing, and selected the sequences having ability to form hairpins
[11]. Sætrom
et al identified some miRNA specific properties like structural conservation in miRNA primary transcripts, which might ead to development of better performing precursor identification tools
[12].
Most of the initial approaches for miRNA candidate identification relied upon the filter based protocols. These included various combinations of rules derived for stem size, loop size, number and size of bulges, GC content, etc. However, such approaches may not be appropriate, particularly when the instances exhibit deviation from the conservation rule. It has been observed that the multi-variate statistical approaches deliver better than the rule based methods. One such pioneering approach had been Triplet-SVM
[13]. There, the authors opined for the need to consider the fact that besides miRNAs, the hairpin structure also exists with several other genomic elements. Therefore, the authors considered psuedo-hairpins for a better model while preparing the negative dataset. The same group also identified a property named triplet element, which captured structural as well as sequence information through support vector models. It resulted into a remarkable increase in accuracy and performance consistency. Subsequently, there was surge in use of different machine learning approaches including Random Forests
[14], Bayesian methods
[15] and many other SVM based tools, where inclusion of triplet or its variants gained importance. Agarwal
et al
[16] developed a method to discover miRNA precursors while applying context sensitive HMM to model RNA secondary structures. The authors used memory supported probabilist models to construct paired regions as well as symmetrical bulges in miRNAs. Ritchie
et al developed MirEval
[17], which combined windowed structural scanning using Triplet-SVM’s methodology
[13] and a protocol to evaluate structural properties. It also implemented phyolgenetic conservation through GERP method
[18] and a sequence homology search method introduced by Tanzen and Standler
[19]. Using Drosha processing site information along with regular sequence and structural features implemented through SVM, successful identification of miRNA precursors was demonstrated by Helvik
et al
[20]. A recently developed tool MiRPara
[21] took a more realistic approach while considering datasets. The authors proposed that the structures and sequences reported in miRBase
[22] might have incomplete information for miRNAs, as in actual the precursors could have longer sequences. Therefore, actual precursors might have different structural and compositional features. The authors identified a few region specific sequence and structural features for partial pri-miRNA sequences, which performed well for large number of species.
As mentioned above, the initial approaches for miRNA discovery had largely relied upon conservation of sequences across various species, homology, hairpin detection and free energy calculation. This resulted into localization of miRNA model building and detection of similar kind of miRNA candidates. Therefore, even with newer approaches, influence of homology would suppress the identification of novel candidates and other unseen properties of miRNAs. Thus, expansion of datasets also became limited. Recent advances with Next Generation Sequencing (NGS) driven technologies helped in guiding the process of miRNA discovery by providing a confidence measure through read mapping to the reference sequences. This also encouraged the genome wide scanning for miRNA candidates with better speed and confidence, reporting novel miRNA candidate regions which were otherwise missed by earlier techniques and tools
[23]. Due to such developments, an approximate exponential increase in number of novel miRNA families is notable in the recent releases of miRBase (
). Leveraging from breakthroughs made by NGS, recently, some groups have developed tools for detection of miRNA candidates using NGS read data. miRDeep
[24] has been one such tools for analyzing data from Illumina Genome Analyzer sequencing platform and for identifying miRNA candidates while considering reads distribution across a reference. The mapped regions are considered to measure the RNA secondary structure based information. Following miRDeep, a few more such tools like miRNAkey
[25], miRanalyser
[26] and MIReNA
[27] have come up.
The present work reports a novel approach to identify miRNA candidates with high accuracy and stable performance over wide range of species. Biologically relevant novel features like miRNA specific mature miRNA guided structural profile matrices and structural triplet density variation profiles with respect to position have been introduced to derive a superior and stable performance. An ensemble machine learning methodology, Bootstrap Aggregating (BAGging), has been implemented. It employs complementary classifiers like Support Vector Machine (SVM), Naive Bayes (NB) and Best First Decision Trees (BFTree) to build the final classifier models for large number of species, enhancing the performance strongly. An NGS module has been built to find miRNA precursor candidates, using Illumina read data. The process of miRNA candidate detection requires large volume of sequence data scanning, which makes it dependent upon extensive computing. Considering this, the entire approach has been implemented as a web-server as well as user friendly standalone GUI version, both in parallel architecture.