The identification of “hits” or “screening positives” is the goal of any primary RNAi screen, and yet remains a point of considerable contention in data analysis. Hit identification is, essentially, the process of deciding which sample values differ meaningfully from those of the negative controls. While some screeners select a discrete number of top scoring samples as screening positives (often as determined by follow-up capacity), a wide range of hit identification techniques is available. The selected hit list forms the basis for further validation screens or investigations.
To reduce the risk of false positives, many practitioners recommend screening multiple reagents targeting the same gene of interest and selecting hits based on the combined results6
; generally, genes are chosen as hits when a majority of tested reagents are screening positives, although the Redundant siRNA Activity technique described below offers a more rigorous approach to combining the results of multiple reagents. False positives may also be limited by combining information from multiple screening outputs22
, an approach that has become particularly viable with the advent of high-content screening. While the techniques of multiparametric analysis are complex and beyond the scope of this work, a useful overview is available in Ainscow 200723
. Such approaches may identify real hits that have high variability in a single read-out metric22
Below we discuss the features of both small-molecule derived techniques (mean + or − k standard deviations, median + or − k MAD and multiple t-tests) and RNAi techniques (quartile-based selection, SSMD for hit identification, redundant siRNA activity, rank product, and Bayesian models).
Mean + or − k Standard Deviations
This approach, which involves selecting a standard deviation threshold (k) of the normalized data relative to the mean and identifying positives as samples that surpass this threshold, is by far the most frequently used hit identification technique in RNAi screening literature (e.g., Bard et al 200624
, DasGupta et al 200525
). It is often used with z-score normalization, but is sometimes used on data that has been normalized by other approaches (such as the B score). This method is particularly appropriate for normally distributed data because the standard deviations from the mean link to an estimate of the probability that hit values are significantly different than the distribution of values for de facto
negative controls. Another advantage is that this method is very easy to calculate and implement; however, it is not robust to outliers. Thus, especially for data in which outliers appear frequently, the application of the commonly used 3 standard deviation cut-off with this approach tends to miss weak hits, while lowering the standard deviation threshold to capture such hits may unacceptably increase the rate of false positives26,27
Median + or − k MAD
An improvement on the mean + or − k standard deviations approach is median + or − k median absolute deviation (MAD) (e.g., Muller et al 200528
). This method is robust to outliers and has been shown to identify weak hits in RNAi data more effectively than mean + or k standard deviations while still capturing the strong hits and controlling false positives27
; it has also been shown to generate fewer false negatives than mean + or − k standard deviations when applied a non-normal data distribution27
, and is also very easy to calculate and implement (although it sacrifices the former method’s easy link to a probability distribution). For these reasons, Chung et al 200827
recommend it as the first-choice approach for hit selection in RNAi screens.
For certain assays, such as those comparing RNAi treatment in the presence of drug versus RNAi treatment alone, it may be appropriate assess the difference in means between replicates for each condition with multiple t-tests (e.g., Whitehurst et al 200729
). This approach is simple to implement and understand, but it requires three or more replicates of each condition and assumes normality of the replicate data. In addition, it is imperative to apply multiple-comparison corrections to the resulting p-values of each individual test if a high false positive rate cannot be tolerated30
, and results of such t-tests are sensitive to outliers31
Researchers who determine that their data distribution is not symmetrical may wish to employ the quartile-based hit identification method. This approach sets upper and lower hit selection thresholds based on number of interquartile ranges above or below the first and third quartiles of the data. Like median + or − k MAD, the quartile method has been shown to identify both strong hits and weak ones while controlling false positives26
. Although this method is easy to calculate, it has not been generally implemented in the RNAi screening community, perhaps because of its modest improvement over the more common median + or − k MAD approach on approximately normal data. Additionally, in quartile-based selection, as in many other robust methods, the rankings produced are not easily translatable into p-values.
SSMD for Hit Identification
The Strictly Standardized Mean Difference metric discussed earlier can be employed for hit identification by screeners concerned with controlling the rate at which siRNAs that have real large or moderate effects fail to be identified as screening positives as well as the rate at which siRNAs that should be considered negative are identified as screening positives32
. Formulae are provided33,34
for calculating the SSMD limits for hit selection based on the desired false positive level, false negative level, or both; while these require a large number of negative controls (> 50), follow-up work31
provided suggested SSMD cut-offs for screens without large numbers of negative samples, such as confirmatory screens.
While the SSMD metric has linear relationship to z-score when only one replicate per siRNA is measured in a screen, these statistically based guidelines may make SSMD more meaningfully interpretable to researchers. Currently SSMD-based hit identification is not calculated by standard analysis packages and is not trivial to implement from scratch.
Redundant siRNA Activity (RSA)
The Redundant siRNA Activity (RSA) analysis method35
is appropriate for researchers seeking to integrate information about multiple RNAi reagents tested for each gene. RSA ranks silencing reagents according to experimental effect and assigns a p-value to all reagents for a single gene based on whether the reagents for that gene are distributed significantly higher in the rankings than would be expected by chance. Because of its use of chance performance as a basis for statistical calculations, RSA is able to provide p-values for gene hits without sacrificing robustness.
Positive reagents identified by this method were found to have higher rates of reconfirmation than those identified with conventional methods, with discrepancies attributable to low reproducibility of orphan individual siRNAs with high activities35
. While RSA is not currently included in common analysis software packages, its developers have made available implementations in C# (for Windows), R, and Perl (see http://carrier.gnf.org/publications/RSA/
Screeners intending to perform screens in biological replicate and seeking a robust hit identification approach that provides estimated p-values may also wish to consider the Rank-Product method, originally developed for use with microarray data36
. The premise of the Rank-Product approach is that a consistent hit should be highly ranked in each independent biological replicate set. The rank-product statistic for each sample across all independent sets estimates this consistency; it can then be translated into a measure of statistical significance by comparing the observed rank product statistic to a rank product statistic obtained from a large number of simulated data sets (providing the statistic expected by chance).
This approach provides p-values for potential hits without requiring the assumption of an underlying probability distribution, but does require significant computation and several replicates per screen to work. While similar to RSA in its comparison of true data rankings to those produced by chance, it does not depend on the use of multiple different RNAi reagents per gene. A Rank-Product implementation suited for use with RNAi screening data has recently been made available as part of the RNAither package37
in the Bioconductor open-source bioinformatics software.
Screeners with appropriate computational resources who seek explicit estimated probabilities that a given siRNA has no effect, an inhibition effect, or an activation effect (rather than the single score produced by other methods) may wish to employ a Bayesian approach described recently by Zhang et al 200838
. Bayesian statistics use Bayes’ Theorem to calculate the probability that a particular hypothesis is true given the observed evidence, and offer a means to update these probabilities when additional evidence is collected. Zhang et al identify three hypotheses of interest (an siRNA has no effect, an siRNA has an activation effect, or an siRNA has an inhibition effect) and develop two models to describe the posterior probability that each of these hypotheses are true for a particular sample given the evidence of this sample’s observed value. The first, and simpler, model is based on using only the negative controls to describe the posterior distribution of the true mean value for the sample given the observed data value. The second, more complex model describes a posterior distribution that assumes the availability of data from both positive-inhibition and positive-activation controls as well as negative controls. Both models also provide the means to calculate the false discovery rate associated with any given hit threshold, but are usable only on screens done in singlicate.
A strength of this approach is that it incorporates both plate-wide and experiment-wide information as well as (depending on the model used) information from both negative controls and the assumed de facto negative samples. When several hit identification approaches were compared, Zhang et al found the simpler Bayesian model to perform best, followed by plate-wise median + or − k MAD. Unfortunately, although the published Bayesian models show great promise, they have not yet been incorporated into commonly available analysis software and are not trivial to implement. Until software applications implementing Bayesian modeling are available, the plate-wise median + or −k MAD approach may be the best alternative.