PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Nat Methods. Author manuscript; available in PMC 2014 January 1.
Published in final edited form as:
PMCID: PMC3818907
NIHMSID: NIHMS475970

MicroRNA Target Site Identification by Integrating Sequence and Binding Information

Abstract

High-throughput sequencing has opened numerous possibilities for the identification of regulatory RNA-binding events. Cross-linking and immunoprecipitation of Argonaute protein members can pinpoint microRNA target sites within tens of bases, but leaves the identity of the microRNA unresolved. A flexible computational framework that integrates sequence with cross-linking features reliably identifies the microRNA family involved in each binding event, considerably outperforms sequence-only approaches, and quantifies the prevalence of noncanonical binding modes.

The role of microRNAs (miRNAs) in gene regulatory pathways has been well demonstrated, including numerous examples in which miRNA dysregulation has been implicated in disease1. As part of the RNA induced silencing complex (RISC), miRNAs guide the complex to repress target messenger RNAs. Immunoprecipitation of Argonaute (AGO) protein family members followed by global profiling of bound RNA has provided an experimental high-throughput approach to map miRNA targets genome-wide2. In particular, cross-linking and immunoprecipitation (CLIP) of AGO proteins provides specific regions of likely AGO binding and miRNA target sites3,4.

The resolution of CLIP experiments has been generally insufficient to unambiguously identify the acting miRNA via sequence matches4. Recent protocols, including iCLIP and PAR-CLIP, increased resolution by providing additional signals diagnostic of cross-linked locations5. In particular, PAR-CLIP obtains high cross-linking efficiency due to the presence of the nucleoside analog 4-thiouridine, which leads to characteristic thymine-to-cytosine transitions in the library reads in the vicinity of protein-mRNA cross-linking4. To define a narrow region (typically ~30nt) most likely bound by the protein, our previously published PARalyzer software6 quantifies this T-to-C transition profile relative to background read density via a kernel density estimator (Fig. 1a).

Figure 1
Identifying miRNA target sites by a joint sequence and interaction model. (a) Example of an AGO-interaction site identified by PARalyzer. Shown are the kernel density estimate of signal versus background T-to-C transitions, and the read depth of a 3′ ...

MiRNA target sites are characterized by one of several types of seed sequence matches unique to each miRNA family7 (canonically, a perfect 7nt complement to positions 2–8 of the miRNA). In AGO-immunoprecipitated libraries, the cross-linking site indicates a radius for RISC-mediated miRNA heteroduplex formation on the mRNA substrate, and restricts the search space for functional seed matches. Furthermore, the PAR-CLIP transition signal is preferentially located directly 5′ of the miRNA seed match on the mRNA4,6,8, likely due to the biophysical properties of the heteroduplex within functional miRNP complexes7 (Fig. 1b). mRNA conservation levels immediately 3′ of peaks indicate that many of these seed matches are selectively maintained (Fig. 1c).

Despite continuous improvements in de novo computational predictions of target sites, efforts have been fundamentally limited by the brevity of sequence motifs describing functional miRNA-target relationships. The quantitative information in AGO PAR-CLIP datasets motivated us to develop an integrative computational method to identify miRNA target sites at the nucleotide level with high accuracy. It combines the T-to-C transition signal with miRNA seed-pairing and its evolutionary conservation, as well as the spatial relation between binding and sequence features. We implemented the model within a general-purpose hidden Markov model (HMM) framework9 called MUMMIE (Multivariate Markov Modeling Inference Engine). We model the shape of PAR-CLIP signals with a 6-state topology (Fig. 1d), in which state 5 expands into a 41-state submodel for detection of seed matches of various types (Fig. 1e). The model is parameterized to preferentially predict seed matches that are long, are located closely 3′ to a PAR-CLIP peak, and are highly conserved. Tradeoffs between these features are naturally accounted for during prediction so that suboptimal sites can still be detected, albeit with lower scores (posterior probabilities; Supplementary Fig. 1).

Most current target predictors identify candidate sites within whole transcripts, usually parameterized on the indirect cumulative effect on mRNA or protein steady-state levels. The availability of in vivo direct binding data changes the problem of target prediction into one of assigning the most likely miRNA(s) to each of the observed AGO binding locations. We applied suitable site-level target predictors to cross-linked regions from multiple datasets, and evaluated the improvement obtained by the explicit joint model in the new target predictor. Analyses were limited to regions covered by aligned PAR-CLIP reads (termed groups); this dramatically increases the accuracy of methods compared to their application without the benefit of CLIP information (Supplementary Fig. 2). Because the identity of the miRNA in bound AGO complexes is unknown, we assessed the relative prediction accuracy via signal-to-noise ratio (SNR) based on randomized shuffling of miRNA sequences (i.e., enrichment of complementary miRNA seeds over decoy sequences among the predictions; see Online Methods).

Specifically, we compared MUMMIE to the following alternatives: The latest release of the state-of-the-art de novo miRNA target site predictor, TargetScan10, utilizes conservation evidence as well as various sequence context scores, but makes no explicit use of CLIP information. MIRZA11 implements a new biophysical model of target recognition, based on context-specific RNA duplex formation estimated on high-quality PAR-CLIP data. We contrasted these model-based approaches to two baseline strategies for identifying miRNA target sites from PAR-CLIP data: choosing the nearest seed match to each PARalyzer peak, or predicting all seed matches within each PARalyzer cluster (the maximal interval in which T-to-C transition rate exceeds background expectation6). All approaches used the 100 most highly-expressed miRNAs in the cell line, which delivered the best trade-off between SNR and sensitivity (Supplementary Fig. 3).

On a PAR-CLIP dataset from a lymphoblastoid cell line infected with Epstein-Barr virus (LCL-BAC)12, our model achieves higher sensitivity, specificity, and SNR than other approaches (Fig. 1f, g). Thus, the joint model has a greater ability to discriminate real from decoy target sites (higher SNR), and it does so by predicting fewer randomized decoy sites within those CLIP groups (higher specificity). TargetScan limits predictions to the three “canonical” seed matches (7mer-A1, 7mer-m8, 8mer-A1), while our model predicts seven types (6mer3–8, 6mer2–7, 7mer-m8, 7mer-m1, 7mer-A1, 8mer-A1, 8mer-m1; see also Supplementary Fig. 4). MIRZA’s energy-based model scores the whole duplex and is therefore agnostic to specific seed matches. While it reaches a higher sensitivity than MUMMIE, by far the largest contribution of “noncanonical” targets appear to involve the 6mer seeds included in MUMMIE. The baseline models exhibited competitive sensitivity, but at very low SNR/specificity levels. Results on additional PAR-CLIP data sets prepared by MNase digestion13, in place of the double RNase T1 digestion used in the original protocol, confirmed all of the observed performance trends (Supplementary Fig. 5; Supplementary Table 1).

To make results as comparable as possible, we used TargetScan’s branch length score as conservation evidence10, which is unavailable for 6mer seed matches. This effectively biases our model against predicting 6mer sites when parameterized to weight conservation highly, and helps improve specificity and SNR at the low-sensitivity end of the spectrum (Supplementary Fig. 2). Increasing the number of miRNAs represented in our model beyond the 100 most highly expressed ones progressively degrades performance (Supplementary Fig. 3), illustrating the complementary benefit of miRNA expression measurements in addition to RISC binding information. Examples of several previously or newly experimentally validated MUMMIE predictions12 involving different seed types are given in Supplementary Figs. 6 and 7.

To directly address the question of whether predicted targets were due to the presence of the specified miRNA, we contrasted the LCL-BAC predictions with those from three additional Epstein-Barr infected cell lines, each of which has one viral miRNA deleted (BHRF1-1, BHRF1-2, and BHRF1-3, respectively). The PAR-CLIP signal in the deletion lines was nearly absent at the locations of predictions involving a BHRF miRNA in the original cell line (Fig. 2, Supplementary Fig. 8). For two miRNAs, the loss of sites was accompanied by consistent upregulation in steady-state target mRNA levels as measured by Illumina sequencing. MUMMIE-based observations agreed with MIRZA-predicted targets (Supplementary Fig. 9).

Figure 2
Validation of predicted sites and their impact on expression levels. We computed the aggregate PAR-CLIP signals at predicted target sites for specific miRNAs expressed in the LCL-BAC cell line, and compared it to the aggregate signal at the same sites ...

As shown in previous studies, the presence of a perfect seed match cannot explain all CLIP groups4,6,14. In the LCL-BAC dataset, the sensitivity of MUMMIE does not exceed 80% at the most lenient settings; a naive scan for 6-mer seed matches to the top 100 expressed miRNAs does not exceed 84% (Supplementary Fig. 10). This motivated us to investigate the extent to which binding with an imperfect seed match might explain remaining CLIP groups. We focused on a recently described class of imperfect targets, namely “bulge” sites, which have been suggested to explain as many as 25% of all the HITS-CLIP clusters for miR-12414. We adapted our model to sites in which one of the mRNA residues pairs with the miRNA only during an initial annealing step but becomes unpaired in the final duplex, forming a bulge in the mRNA. In addition to the previously proposed pivot pairing between mRNA positions 5–6, we investigated potential bulges between positions 4–5 and positions 3–4. Bulge site predictions showed distinctly lower SNR profiles, and when combining all three pivot locations, predictions covered only ~15–20% of orphan groups (~1–2% of total groups; Supplementary Table 2, Supplementary Figs. 11, 12, and 13). Even MIRZA, which does not make any seed type assumptions, assigns miRNAs to only a small fraction of additional clusters.

Our findings illustrate the superiority of transcriptomic assays over de novo sequence analysis for identifying functional sites active in a specific condition, as well as of principled modeling techniques over ad hoc strategies. By jointly modeling multiple relevant variables — including evolutionary conservation, residue transition rates, spatial positioning, and sequence composition — MUMMIE provides for finer discrimination between true and decoy sites. There is room for further improvement, via incorporation of additional evidence including structural accessibility, and relative or absolute production or steady-state levels of miRNAs or mRNAs. The framework provides flexibility in the ability not only to incorporate additional covariates, but also to choose the desired compromise between sensitivity and specificity, and to define additional classes of seed matches (e.g., centered sites15 or compensatory 3′ binding16). However, in our comparative evaluation, genuinely imperfect seed sites appeared to account for a small fraction of total targets only. The remaining 20% of AGO clusters may reflect low-affinity transitory binding, experimental noise, or AGO targets that do not involve miRNAs17.

Due to the exploitation of complementary information sources, we have arrived at a performance level where the question of miRNA targeting is no longer a binary one, but one of quantitative impact of a miRNA on a transcript (cf. Supplementary Fig. 6; this idea has specifically been pursued in the concurrently developed MIRZA). To this aim, however, multiple genomic data sets have to be generated; recent sequence-only target predictors remain more generally applicable18, albeit at the cost of significant reductions in accuracy. Yet, distinct transition patterns are also observed for other post-transcriptional regulators such as Quaking and Pumilio4, suggesting the utility of such models for RNA-binding protein target identification in general.

Online Methods

PAR-CLIP Data Generation

Multiple Argonaute 2 PAR-CLIP libraries from two different human cell line sources (LCLs and HEK293) were used in this study. Established lymphoblastoid cell lines (LCL-BAC, LCL-BAC-D1, LCL-BAC-D2, LCL-BAC-D3) from the same donor were infected with variations of EBV B95-8 Bacmid. LCL-BAC-D1, LCL-BAC-D2, and LCL-BAC-D3 (the “deletion lines”) were infected with EBV miRNA-deficient viruses lacking miR-BHRF1-1, miR-BHRF1-2, and miR-BHRF1-3 expression, respectively20. All LCLs were maintained in RPMI 1640 medium supplemented with 15% FBS and 10 μg/mL gentamicin (GIBCO) and analyzed three to five months post-infection. PAR-CLIP data for LCL-BAC, LCL-BAC-D1 and LCL-BAC-D3 was previously published by us12 (GEO GSE41437); PAR-CLIP data for LCL-BAC-D2 was newly generated following the same protocol, which involved treatment with double RNase T1 digestion.

LCL-BAC libraries were prepared using the original PAR-CLIP protocol4, which includes double RNase T1 digestion steps to fragment cross-linked RNA prior to sequencing. This has raised concerns that identified binding events may exhibit biases due to sequence preferences of the digestion enzymes13,14. We therefore additionally analyzed previously published PAR-CLIP data from HEK 293 cells, two biological replicates involving MNase digestion of isolated RNA in place of double T1 digestion (“MNase A”, GEO GSM714646; “MNase B”, GSM714647)11. See Supplementary Table 1 for library and annotation metrics for all PAR-CLIP libraries.

For miRNA target prediction we considered only genomic intervals annotated as 3′ untranslated regions (UTR) in Ensembl v. 5821; for genes with multiple annotated transcripts (isoforms), we kept the one with the longest 3′ UTR (measured in bases of mature transcript). For the LCL-BAC data set, the miRNA set used for prediction is based on small RNA deep-sequencing data of the same culture12. For the external MNase data sets, we used the miRNA set of the same cell line that was reported in a previous study4.

PAR-CLIP Data Processing and Significance Testing

PAR-CLIP data for all cell lines were processed by PARalyzer6 to produce sequences of continuous values. PARalyzer smoothes the sequence of discrete read count values via kernel density estimation, and the resulting continuous values represent relative frequencies of T-to-C transitions. This sequence is referred to as the “signal track” for use in MUMMIE. In practice, the PARalyzer signal track for a complete UTR will typically be sparse; i.e., zero everywhere except at and around an occupied binding site. These nonzero regions, which correspond to scaffolds of aligned PAR-CLIP reads, are termed “groups.” When binding sites are close together, their scaffolds may coalesce, resulting in one group for multiple occupied sites. PARalyzer therefore defines individual “interaction sites” by identifying maximal intervals in which the T-to-C transition signal is higher than the background; we term these intervals “clusters.” Clusters therefore occur within groups, and generally provide for a higher resolution of RBP target sites. However, their utility in pinpointing miRNA target sites in AGO libraries is complicated due to lack of T-to-C transition signal within the miRNA-mRNA duplex site: the signal “peak” generally does not overlap the actual miRNA seed match, and the PARalyzer software therefore provides an option to “pad” and increase the size of AGO clusters. Here, other than for the baseline seed match comparison, we do not explicitly use PARalyzer clusters; instead, we run MUMMIE on the complete UTR, after normalization of the signal for individual groups. Normalization of groups ensures that the maximum signal value for each group is 1.0, and helps to control for differing binding affinities, sequencing biases, and miRNA/mRNA abundance.

To validate library-specific predictions, we compared the PAR-CLIP signal at predicted “true” sites, for a miRNA expressed in LCL-BAC, against the signal in one of the LCL-BAC-D1/2/3 libraries not expressing the miRNA of interest (Fig. 2a). The significance of the signal difference was assessed by a non-parametric test, in which we first ranked target sites of all top 100 expressed miRNAs by their difference in signal and quantified the enrichment of true targets at the top, i.e. among the most-changed sites. This implicitly accounts for the overall difference in signal observed across all targets, due to incomplete saturation of the deep sequencing libraries (Fig. 2b). Specifically, we started from the smoothed T-to-C transition frequencies as computed by PARalyzer, i.e. the signal track including normalization as described above. To represent a binding signal, we summed over frequencies in a local area (±3 nucleotides) around a consistent reference location, in this case the mRNA coordinate across from miRNA position 8, i.e. the beginning of the longest possible seed match. As test statistic, we used the fraction of the binding signal detected in the deletion line compared to the signal in the wildtype. Significance was assessed by the Wilcoxon rank-sum test.

Quantification of mRNA Level Changes

Paired-end strand-specific RNA-seq was performed for all four established LCLs. Total RNA was collected with TRIzol (Life Technologies) and libraries were prepared using ScriptSeq v2 kit and Ribozero (Epicentre). Biological replicates were performed for LCL-BAC-D1. RNA-seq data was aligned and quantified using RSEM with reference genome and transcript annotation23. Differential analysis was performed using EBseq within RSEM24. Real fold change (RealFC value from EBseq) for genes with median log2(TPM + 1) > 5 across all samples were utilized for evaluating differential expression of MUMMIE or MIRZA predictions; the same sets of genes were used to assess loss of binding events in the deletion libraries. See Supplementary Table 3 for specific analysis parameters and quality assessments.

To assess the enrichment of predicted targets of a specific miRNA among the top differentially regulated gene sets, genes were ranked by mRNA expression fold change (based on the data for LCL-BAC vs the corresponding deletion line) and grouped into sets of increasing size. For each top differentially regulated set, an enrichment of predicted targets was computed as a ratio of a fraction of predicted targets in the top set and a fraction of total number of predictions in the full gene set. Target sets of miRNAs with a similar number of MUMMIE predictions (between roughly half to twice as many sites) were used to quantify background expression changes of presumably unaffected targets.

Luciferase assay

293T cells were maintained in Dulbecco’s modified eagle’s medium (DMEM) supplemented with 10% fetal bovine serum (FBS) and antibiotics. miRNA expression plasmids contain ~200 nucleotides of the primary miRNA cloned from genomic DNA into pLCE at XhoI/XbaI. Expression of the miRNA was confirmed using indicator assays as previously described12. Luciferase reporter assays were carried out as previously described12. Briefly, 293T cells were co-transfected in 24-well plates with 1 μg miRNA expression vector (pLCE-miRNA), 12 ng pL-CMV-GL3-3′UTR, and 12 ng pL-CMV-RLUC (renilla internal control) using Fugene 6 (Promega) according to manufacturer’s instructions. Lysates were harvested 48–72 hrs post-transfection and assayed for luciferase activity using the dual luciferase reporter assay kit (Promega).

Modeling Software

Binding site models were implemented within the open-source software MUMMIE (MUltivariate Markov Modeling Inference Engine), available online at http://www.genome.duke.edu/labs/ohler/research/MUMMIE. This system permits user-specified model topologies to be designed for a wide array of potential bioinformatic applications. The software includes command-line routines for both parameter estimation (“training”) and decoding (“prediction”). The following sections describe the models and algorithms used to obtain predictions for this study.

Models

Prediction of target sites from binding and sequence data is achieved via hybrid continuous-discrete multivariate hidden Markov models. A Markov model is a probabilistic generative model for sequential data with a fixed set of discrete states Q={q0, q1, …, qN−1}. We here consider discrete-time models, in which observations are available for T={0,1,…,L−1} positions in a biological sequence S=s0s1sL−1, which the model is assumed to have emitted. Emitted sequence elements may take continuous values; for multivariate models, each sequence element is a vector of continuous values (each emitted by a separate “emission channel”). In hybrid continuous/discrete models, each emission channel is either discrete or continuous. Thus, for each vector in the emission sequence, each component of that vector will be a value that is either a symbol drawn from a discrete alphabet, or a numerical value. For example, an RNA emission channel would accept symbols from the set {A,C,G,U}, an “aligned read-count” channel would accept nonnegative integers denoting numbers of reads aligned at a given genomic position, and a “probability of being unpaired” track would accept real values between 0 and 1.

The model M emits a biological sequence S as follows. Beginning in a special silent state q0, at each time step t the model chooses a state to transition into, by drawing a new state xt+1[set membership]Q from a transition probability distribution, P(x|xt), which is conditional on the current state, xt. Upon entering the new state, it draws a random vector y[set membership]V from the emission distribution P(y|xt+1), where V is the set of all possible emission vectors for the model. After emitting the vector’s components into the respective emission channels, the model chooses the next transition according to P(x|xt). If state q0 is chosen, the machine terminates and S is taken to be the complete emission; otherwise, it continues its run by alternating between transitions and emissions, until state q0 is eventually chosen. Note that q0 never makes emissions.

Because the emission distribution is conditional only on the current state, each emission is conditionally independent of all other emissions, given the associated state. This is the case of 0th-order emissions. A model with higher-order emissions may condition each emission on events (transitions or emissions) progressively further into the past. For the present work we consider only higher-order discrete emissions, in which the emission of a symbol may be conditioned on some number of immediately preceding symbols emitted into the same channel; this permits us to model seed matches for many different miRNAs while using a relatively small number of states (see below).

For miRNA target prediction, we designed a 47-state Markov model. The model is a hybrid continuous/discrete multivariate hidden Markov model with three emission channels. The first channel is the PARalyzer signal representing the relative abundance of T-to-C transitions in aligned CLIP scaffolds. This channel is continuous and modeled via a four-component Gaussian mixture. The second channel, branch length score (BLS), provides a continuous measure of conservation; it was computed using a script from the TargetScan package. The third channel comprises the genomic DNA sequence corresponding to the mRNA, and is modeled using a 7th-order Markov chain. Channels are pre-aligned so that for each position in a UTR we have one nucleotide and two continuous values.

The model topology is depicted schematically in Fig. 1d. State 1 models the background regions between PAR-CLIP groups. State 3 models the peak of the PAR-CLIP group, while states 2, 4, 5, and 6 model the flanking regions around the peak. State 5 is a metastate25 which expands into the 41-state submodel shown in Fig. 1e; this metastate models the actual target site of the miRNA seed. Our default model detects seven types of miRNA binding: 6mer3–8, 6mer2–7, 7mer-m8, 7mer-m1, 7mer-A1, 8mer-A1, and 8mer-m1.

The model in Fig. 1d preferentially visits state 5 (the miRNA binding site) a short distance 3′ of a PAR-CLIP peak (which is generally matched by state 3); this preference reflects the strong enrichment of miRNA seed matches 3′ of peaks (Fig. 1b). In consequence, model predictions also show a strong preference for this location relative to the peak (Supplementary Fig. 1).

Fig. 1b suggests a small fraction of sites 5′ of the peak, and the model predicts a fraction of these 5′ sites. This illustrates the flexibility of this class of models: when the most promising site occurs 5′ of the peak, the decoding algorithm will permit state 3 to be visited earlier in the sequence so that state 5 can match the putative target site. Likewise, when no promising seed match is present, the decoding algorithm may choose to not predict any binding site for the PAR-CLIP group. This flexibility is a function of the model parameters, in particular the variance of the emission distribution for the PAR-CLIP signal: for low variances (~0.01) the model predicts many sites (typically one per PAR-CLIP group), while for high variances (~1.0) it predicts fewer sites. The different points in Fig. 1f, g correspond to different settings of the PARalyzer signal variance, with a high variance parameter corresponding to high specificity and high SNR, and a low variance parameter corresponding to high sensitivity (the following variances were used for these plots: 1.5, 1.25, 0.75, 0.5, 0.375, 0.25, 0.20, 0.15, 0.1, 0.075, 0.05, 0.01, 0.005).

To model bulge sites with bulges at specific positions complementary to the seed, we use the same metamodel structure shown in Fig. 1d, but simplify the submodel for state 5 to include only a single linear chain of seven states to match the six seed residues plus the bulge position embedded within the seed.

Inference Algorithms

Prediction of sites can be carried out in MUMMIE using either Viterbi decoding or posterior decoding, or a combination of the two.

Viterbi decoding identifies the single most probable path through the HMM states for emitting the multivariate sequence. Any occurrence of metastate 5 in that most probable path is interpreted as a predicted miRNA binding site; the exact sequence of states within the metastate dictates the type of binding (e.g., 6mer2–7, 7mer-m1, etc.), and the matching nucleotides in the emitted sequence identify the miRNA family (seed sequences may or may not be unique to a given miRNA within a family; when multiple miRNAs share a predicted seed, all are listed as the potential binding miRNA).

In contrast to Viterbi decoding, posterior decoding utilizes the well-known forward and backward algorithms25 to compute the posterior probability P(q,k) that state q was active at position k when the sequence was generated by the HMM. In addition to providing this standard functionality, MUMMIE also computes for each 8bp interval the posterior probability that a miRNA binds within that 8bp interval, via any of the modeled binding modes (i.e., by summing the posteriors for each binding mode for that miRNA within that 8bp interval).

Finally, MUMMIE supports a combination of posterior and Viterbi decoding via the posterior Viterbi algorithm26, in which emission and transition probabilities in the Viterbi algorithm are replaced with posterior probabilities. This decoding algorithm was used for the results presented here.

Model Training

Considering all states modeling the actual miRNA binding site to be foreground states and all others to be background states, training of the background states was accomplished by applying expectation maximization (EM) to entire UTRs, with the following exceptions. For the peak state (state 3), the PARalyzer signal mean was set to 1.0 and its variance was arbitrarily set to 0.01; for the flanking states (states 2, 4, and 6) the signal mean was set to 0.5 and the variance to 0.01; these mean values were chosen to have these states match the respective regions of an idealized PARalyzer signal profile at a binding site. The mean and variance of the conservation track were set to 0 and 0.01, respectively, for all background states.

For the foreground states, the nucleotide emission distribution was set so as to permit only the training miRNA seeds to be emitted; PARalyzer signal mean was set to 0.5 and variance to 0.01 as in the background flank states; conservation mean was set to 3.0. Nucleotide sequence emissions for flanking states were trained via EM on sequences within PARalyzer groups, and for the peak state on the collection of single nucleotides at each PARalyzer peak (Supplementary Table 4). Variance of the conservation track was trained on a held-out set of training data by observing the resulting sensitivity-vs.-SNR curves at different settings and selecting the value that produced the curve with the highest overall SNR. All covariances were set to 0. The Markov order of nucleotide emissions varied from zero to seven along the length of a seed, so as to enforce that only valid miRNA seeds may be emitted, as per the training set.

Bulge Site Analyses

We investigated different locations for the pivot nucleotide in bulge site pairing. A bulge seed match is a perfect seed match to miRNA position 2–7 with a bulge nucleotide insertion that can pair with the pivot nucleotide. Similar to the canonical perfect seed match prediction, we computed SNR of bulge site predictions. Here, we excluded the bulge seed matches that share a 6mer with any human or EBV miRNA.

Additionally, for bulge type 2 (position 4–5) evaluation and prediction, we excluded bulge seed matches to original miRNAs with the same nucleotide at position 5 and 6, as its pairing in the in the nucleation step could extend one more position and thus provide for the same bulge seed match as in type 1 (position 5–6). Similarly, seeds for bulge type 3 (position 3–4) did not include bulge seeds that extended to bulge type 2.

Prediction Accuracy Evaluation and Comparison with Other Predictors

We computed signal-to-noise (SNR) ratios by comparing numbers of non-shuffled and shuffled sites among a set of predictions. Shuffled miRNAs were created via a random process that preserves the dinucleotide frequencies observed in real miRNAs, as described previously27. These were then screened to ensure that their expected seed-match frequencies in PAR-CLIP groups did not differ by more than 15% from the same statistic for the real (unshuffled) miRNAs. The signal-to-noise ratio was averaged over 1,000 runs in which we sampled one shuffled miRNA for each original (non-shuffled) miRNA; state 5 of the model was then trained on the resulting set of shuffled and non-shuffled seeds, so that a single decoding run of the model could predict any of the seeds in the training set.

We used TargetScan release 6.110 to compute TargetScan predictions, “context+” scores, and branch length scores. A 23-way alignment of 3′ UTRs was used, and miRNAs were evaluated individually. We computed signal-to-noise ratio at varying context+ score thresholds for the TargetScan sensitivity-SNR plots.

MIRZA11 was executed in its default “noupdate” mode, and optionally provided with cell-type specific expression levels of miRNAs. Mirroring the example of its developers, inputs to MIRZA were 51nt sequences centered on the peak T-to-C conversion site as identified by PARalyzer, and 21nt mature miRNA sequences. MIRZA’s “target frequency” score was used to rank predictions when computing sensitivity, specificity, and SNR.

Supplementary Material

Acknowledgments

This work was supported by grants from the National Science Foundation (MCB-0822033) and the National Institutes of Health (R01-GM104962) to U.O. and by awards from the National Institutes of Health (R01-AI067968; R01-DA030086) to B.R.C. Contributions by R.L.S. were additionally supported by a Duke Center for AIDS Research (CFAR) small grant award (P30-AI064518). We thank S. Grosswendt and M. Piechotta for critical reading of a manuscript draft.

Footnotes

AUTHOR CONTRIBUTIONS

W.H.M implemented the modeling framework and designed the models, P.L. and W.H.M. performed the computational experiments, P.L. investigated bulge sites, N.M. and D.L.C. configured and ran the PARalyzer pipeline and provided guidance on analyzing its outputs. P.L., W.H.M., and U.O. analyzed the data. R.L.S., B.R.C. and N.M. designed and performed wet lab experiments, and W.H.M., P.L., and U.O. wrote the manuscript.

References

1. Jiang Q, et al. Nucleic Acids Res. 2009;37:D98–104. [PMC free article] [PubMed]
2. Hendrickson DG, et al. PLoS ONE. 2008;3:e2126. [PMC free article] [PubMed]
3. Chi SW, Zang JB, Mele A, Darnell RB. Nature. 2009;460:479–86. [PMC free article] [PubMed]
4. Hafner M, et al. Cell. 2010;141:129–41. [PMC free article] [PubMed]
5. Sugimoto Y, et al. Genome Biol. 2012;13:R67. [PMC free article] [PubMed]
6. Corcoran DL, et al. Genome Biol. 2011;12:8. [PMC free article] [PubMed]
7. Bartel DP. Cell. 2009;136:215–233. [PMC free article] [PubMed]
8. Baltz AG, et al. Mol Cell. 2012;46:674–90. [PubMed]
9. Juang BH, Rabiner LR. Technometrics. 1991;33:251–272.
10. Friedman RC, Farh KK-H, Burge CB, Bartel DP. Genome Res. 2009;19:92–105. [PubMed]
11. Korshid M, Hausser J, Zavolan M, van Nimwegen E. Nat Methods. 2013;10:253–255. [PubMed]
12. Skalsky RL, et al. PLOS Pathog. 2012;8:e1002484. [PMC free article] [PubMed]
13. Kishore S, et al. Nat Methods. 2011;8:559–64. [PubMed]
14. Chi SW, Hannon GJ, Darnell RB. Nat Struct Mol Biol. 2012;19:321–7. [PMC free article] [PubMed]
15. Shin C, et al. Mol Cell. 2010;38:789–802. [PMC free article] [PubMed]
16. Brennecke J, Stark A, Russell RB, Cohen SM. PLoS Biol. 2005;3:e85. [PubMed]
17. Leung AK, et al. Nat Struct Mol Biol. 2011;18:237–44. [PMC free article] [PubMed]
18. Betel D, et al. Genome Biol. 2010;11:R90. [PMC free article] [PubMed]
19. Siepel A, et al. Genome Res. 2005;15:1034–50. [PubMed]
20. Feederle R, et al. J Virology. 2011;85:9801–10. [PMC free article] [PubMed]
21. Flicek P, et al. Nucleic Acids Res. 2009;38(Suppl 1):D557–D562. [PMC free article] [PubMed]
22. Subramanian A, et al. PNAS. 2005;102:15545–15550. [PubMed]
23. Li B, Dewey C. BMC Bioinformatics. 2011;12:323. [PMC free article] [PubMed]
24. Leng N, et al. Bioinformatics. 2013;29 doi: 10.1093/bioinformatics/btt087. Advance Access ePub. [PMC free article] [PubMed] [Cross Ref]
25. Majoros WH. Methods for Computational Gene Prediction. Cambridge University Press; Cambridge, UK: 2007.
26. Fariselli P, Martelli PL, Cadadio R. BMC Bioinformatics. 2005;6(Suppl 4):S12. [PMC free article] [PubMed]
27. Lekprasert P, Mayhew M, Ohler U. PLoS ONE. 2011;6:e20622. [PMC free article] [PubMed]