In this article, we described RIPSeeker, an HMM-based R software package specifically tailored to analyze RIP-seq data with statistical rigor. As a proof-of-concept, we first tested RIPSeeker’s performance on the simulated data generated from a two-state HMM with known NB parameters and observed, on average, 85–100% accuracy (Supplementary Figure S1
). To demonstrate the utility of the software package in the real-world application, we made use of three independent RIP-seq datasets and two PAR-CLIP datasets, including, in total, 12 sample comparisons corresponding to six distinct proteins (RIP-seq: PRC2, CCNT1, ELAVL1 and PABPC1; PAR-CLIP: PUM2 and QKI). As comparisons, we applied to the same datasets, six state-of-the-art algorithms, including three ChIP-seq algorithms (MACS, QuEST and HPeak), two RNA-seq methods (Cuffdiff and Rulebased) and one PAR-CLIP program (PARalyzer) (). We also tested DESeq (as the third RNA-seq strategy) on these datasets and decided not to include it in the subsequent comparisons due to the small number of significant transcripts identified by the method (detailed in Supplementary Methods
Section Bioconductor package DESeq settings and results). Based on the pairwise and multi-way overlap analyses (Supplementary Figures S7–S9
), RIPSeeker not only has generally good agreements (~50% on average) with the ChIP-seq or PAR-CLIP algorithms in their predictions on the 12 sample comparisons, but also demonstrated its robustness in the consistent predictions using distinct negative controls (T7-tag and the RIP RNA input) for the same RIP treatments in the same cell line corresponding to the protein ELAVL1 or PABPC1 ().
The observed good agreements among most of the tested methods prompted us for a more rigorous and quantitative comparison based on AUC derived from the ROC plots (Section Receiver operating curve). Due to the insufficient canonical transcripts (‘gold-standard’) known to associate with the six available proteins, we constructed for each of the 12 sample comparisons, a list of confidence peaks overlapped by peaks from the majority (at least half) of the tested methods and used such a list to benchmark each method based on their sensitivity and specificity as functions of decreasing score-specific cutoff in discriminating the confidence peaks from other peaks within their own predictions. The resulting AUCs from the ROCs were then used to evaluate and compare the performances of the comparison methods on each sample. RIPSeeker demonstrated superior performances by having the highest AUC, averaging >80% in 9 of the 12 tests (). The results from these unbiased analyses not only consistently favor the proposed RIP-seq program but also suggest a sensible way for the users to prioritize RIPSeeker predictions based on the corresponding statistical confidence.
Some explanations are needed for some methods falling below the diagonal line of the ROC plots in some tests, which may seem worse than random. As the true-negatives TN are unknown, we used as proxy, the predictions from each method that are not
in the consensus peak list (defined as the peaks consistently agreed on by a majority of the methods). Thus, the ROC is relative
to each method, with the TP and TN defined as the number of predictions from that method that have overlap and no overlap with the consensus peak list, respectively. Consequently, the TP and TN are not necessarily equal. Thus, a method can have most of its peaks not in the consensus set, leading to TN much greater than TP. Moreover, if the scores from that method do not properly favor the minority of the TP, then we will observe a ROC lying at the lower triangle portion of the plot leading to <50% AUC, which can be seen with the Rulebased method and several other methods in some panels (e.g. b–d). Conceptually, the peak callers here can be considered as ‘unsupervised classifiers’ that call peaks directly from the genome. To some extent, a completely insensible method would be equivalent to randomly sampling genomic regions from the genome. Thus, the ‘random peaks’ will unlikely to have any overlap with the peaks from other sensible methods. As a result, ROC corresponding to that method will have a flat curve along the x
-axis (FPR), resulting in a zero AUC. Thus, given the large search space of a mammalian genome, a method having an AUC ~0.5 in the current comparison is actually much better than random. If the test itself were insensible, especially when the comparison methods generally disagree, then all of the methods will have AUC ~50% or less, which is not
the case because we observe good pairwise and multi-way overlaps among the comparison methods in all of the 12 tests (Supplementary Figures S7–S9
), and there are more than one method in each panel having AUC much higher than 50% ().
Furthermore, we demonstrated at the genome scale that the peaks from RIPSeeker and other comparison methods such as QuEST and HPeak that performed competitively in the ROC evaluations are biologically meaningful. In particular, the peaks for four of the six proteins, namely, ELAVL1, PABPC1, PUM2 and QKI, are significantly enriched for genomic elements, implicating their functions suggested in the literature (Supplementary Figure S11
). Moreover, the top 5000 peaks from RIPSeeker for PUM2 and QKI are the most significantly enriched for the previous published motifs. Finally, RIPSeeker demonstrates its sensitivity by identifying the canonical PRC2 and CCNT1 interacting ncRNAs with high statistical confidence and peak length close to the natural length of the lncRNAs (Supplementary Figure S19
In terms of usability, the front-end main function ripSeek is sufficient for most applications. The function takes as the only required argument the path to alignment files (BAM/BED/SAM) and outputs predicted RIP regions. Optionally, user may indicate through cNAME which among the input file(s) is (are) control to enable eFDR calculation. If the arguments biomaRt_dataset and/or goAnno are set, ripSeek will return the annotated RIP predictions and the enriched GO terms, respectively. RIPSeeker also supports paired-end read alignments. However, there are currently no paired-end RIP-seq data available. For the interest of space, many other features such as paired-end support (using RNA-seq data), visualization of read coverage (Supplementary Figures S3
), GO enrichments (Supplementary Tables S2
) and programmatic access to the UCSC genome browser for visualization (Supplementary Figure S20
) are not demonstrated in the main text. For more details, please refer to the R documentation and vignette that come with the package. Moreover, the mixture NB and HMM functions in RIPSeeker package are implemented as general purpose function, which can be used to model any sequential count data with arbitrary number of hidden states. In fact, most functions provided in RIPSeeker can be used as standalone functions, allowing flexible customization to suit user’s own workflow. For instance, user can choose to run HMM functions followed by disambiguating multihits to just save the alignment output as GappedAlignments object for other analyses within R. As another example, user could estimate the known transcript or gene expression in RPKM or FPKM (fragments per kilobase of million mapped reads for paired-end reads) from the RNA-seq data using computeRPKM function. Together, RIPSeeker serves as a bioinformatics suite for various computations.
Because the RNA–protein interaction may arise from associations generated after cell lysis, further experimental validation is required to confirm that the protein indeed interacts with the predicted transcript in vivo
). Ideally, validating functional association can provide strong support to the physical interaction observed from the RIP-seq analyses. For instance, based on their RIP-seq results, Zhao et al.
) validated the interaction between PRC2 and Meg3
) through RNAi knockdown and over-expression experiments to support their hypothesis that Meg3
recruits PRC2 to repress the upstream imprint gene Dlk1
Biological replicates are an important addition to the RIP-seq analysis to further filter out false-positives. Currently, we only provide a helper function to combine peaks identified separately from the biological replicates. As future works, RIPSeeker will incorporate biological replicates into the framework in fitting the HMM model and in hypothesis testing taking into the sample variance. Additionally, the peaks obtained by RIPSeeker may be further trimmed up to where the alignments occur and end within that region to refine the peak resolution, which facilitates accurate primer design for experimental validation. Another useful future addition will be to add an input option for RNA-seq alignments assumed to come from the same sample desirably under similar conditions as the RIP-seq experiment. In that case, RIPSeeker will weigh the peaks based on the model trained on RNA-seq data assuming a positive correlation between the RIP-seq and RNA-seq signals. Finally, the parallel computing option speeds up the computation by a factor proportional to the total number of CPU cores but may impose larger memory overhead than the singe-threading approach. Performance optimization is needed to minimize memory trace.
In perspective, RIP-seq analysis provides information for transcripts that physically interact with a regulatory protein. As the ENCODE data recently become available (1
), it is possible to correlate the RNA mediators from RIP-seq results with the known gene targets implicated in the ChIP-seq data based on their co-expressions measured by RNA-seq across experimental conditions. With analytical frameworks currently under active development for sequencing data, an integrative approach as such is at the horizon to elucidate the global RNA regulatory network.