The work presented here is the first published study to experimentally define the factors that influence reproducibility between genome-scale siRNA screens. The high throughput loss-of-function genomic screening community has reached a critical milestone in the application of this promising technology. Multiple teams have completed whole genome siRNA-based screens for factors involved in the same biological system, and the obvious first reaction was to compare the overlap between related screens.2,15,28
This seemingly straightforward task resulted in a marked lack of overlap among the published hit lists. A comprehensive meta-analysis attempted to reconcile the data from these substantially different assay systems, and did produce some additional overlap of gene families.4
Without a clear idea of the reproducibility expected, interpretation of the limited overlap was clouded. Our study attempts to shed light on the issue of reproducibility of genome-scale siRNA screening and provide a context for interpretation of published screening data.
Within six months, two human whole genome siRNA-based screens were completed. The genomic screens rigorously adhered to the same protocols, utilized the same instrumentation, used the same batch of virus and were performed by the same team. The only intentional difference was the four months separating the completion of GS1 and beginning of GS2. Variables that were not accounted for included changes in the batches of reagents and the passages of the cell line.
The 2 × 2 pooled library format described in ensured that hit identification would require that at least two of four siRNAs induce a phenotype.7
If both positive siRNAs resided in one well, the hit would be missed. The probability of such a distribution by random chance is 0.18, or 18% of the possible scored wells.
Off target effects are defined as changes in gene expression for genes not intentionally targeted by the siRNA design. Jackson et al. 2003 demonstrated that decreasing the effective concentration of siRNA targeting MAPK14 decreased OTEs while preserving the integrity of the target-specific knockdown. They also noted some OTEs could not be titrated away. Our assay conditions were designed to minimize OTE by utilizing a relatively low effective siRNA concentration of 15.4nM (7.7nM for each siRNA) when compared to other published screens (Supplemental Table 2
We addressed the impact of screening format on OTE as well as reproducibility by examining the effect of pooled siRNA duplexes on cell density. In this case, effects on cell density can be considered an unintended or OTE. The scatter plots from demonstrate the remarkably reproducible effect a given pool of two siRNAs had on cell density, while independent pools had entirely independent effects on cell density. Although we are unable to rule out the observed effect could be due to influences such as relying on a single source (Qiagen) for the design and manufacturing of the siRNA library, these data demonstrate that a given pool of siRNAs will produce similar results when tested multiple times. The rationale that multiple tests of a single system better illustrates the true range of the system has sound statistical support. Unfortunately, studying the cell density results in which the same pool of siRNAs is tested twice () and contrasting it to the population distributions generated by testing independent pools () clearly shows that the experimental interpretation of the function of a target mRNA is not defined without independent tests. We understand that additional validation of our hit list is required, however by identifying those targets which display the appropriate phenotype with two independent pools, we fulfill the important criteria of redundancy in the primary screen.
Recently, Brass et al. 2009 published a screen which identified factors influencing H1N1 propagation in a human model cell line utilized three tests of the same pool of four siRNA duplexes. This study reported 334 targets from the primary screen, and a follow-up screen to de-convolute the pools resulted in 40% of the 334 putative hits being confirmed by at least two independent siRNAs. As a contrast, consider Zhang et al. 2009 screen for modifiers of circadian cycle. The authors here used the pooled 2 × 2 format and followed up on hits that scored well in both independent tests. 78% of the 343 putative hits reconfirmed with two or more siRNAs eliciting a strong effect in a validation screen. Despite screening the genome three times, representing a 50% increase in the workload relative to the Zhang et al. 2009, Brass et al. 2009 did not improve the resolution of the mRNA function.
Our study used the data from each of the four 74 plate sets; GS1AB, GS1CD, GS2AB, and GS2CD as batches for analysis for several reasons. On each 384 well plate, four negative siGFP and four positive si-vATPase controls were arrayed. Both SSMD and z-score methods require a comparison to a negative control set, so the minimum dataset for analysis must be one assay plate. Zhang et al. 2008 demonstrated using SSMD how variability in control wells contributed strongly to the calculated assay performance. This indicated that opportunities to combine more plates into a single dataset could mask the impact normal variation in the negative control set had on the analysis. Further, Qiagen manufactured their library in a systematic fashion. Consequently, 65% of the siRNA duplexes targeting the GPCR family members were arrayed on a single plate, and the remaining ones were distributed on two adjacent plates. Other gene families are arrayed similarly. SR utilized the population to determine a hit, so any population with significant bias would inhibit the application of SR, thus we chose larger datasets to provide protection from plating bias. qualitatively indicated that data within each screen performed consistently, while the calculated Z’-factor quantitatively indicated any variability which was present was unlikely to negatively impact the resolution of hits as strong as the si-vATPase. Although we concluded that treating all wells in a single siRNA set as one dataset would be appropriate, one must carefully consider which approach best suits the behavior of the data sets.
Genomic screening is costly. Significant resources are invested during assay development to determine how best to pursue factors involved in a biological system. It seems contradictory that the investment in the cell-based assay has not been complemented by an investment in understanding the interpretation of screening results. We measured the reproducibility between two genomic screens by directly testing how much overlap there was between specific analysis methods. Applying the statistical methods SR, MAD, z-score and SSMD by following the recommendations of each methods’ authors produced a broad range of overlap.6,20,25,27
It would seem in some cases that there was little overlap between identical screens. Unfortunately, the length of the hit lists was dramatically different. For example, comparison between SSMD in GS1 and GS2, the potential overlap could include all 513 hits from GS1, but the potential overlap is only 45% of the 1,140 hits in GS2. Further, it was difficult to assess the reproducibility between hit lists within a single screen because the different analytical methods produced significantly different length lists.
In order to compare methodologies within and between genomic screens, we chose to compare lists composed of the same number of hits. The top 200 hits for any method produced between 39% to 49% overlap. Expanding the list to include the top 500 hits did not improve the apparent reproducibility as the range was 32% to 41% overlap. The best indication that genomic data is regularly reproducible was established by comparing the top 200 from any individual method to the top 500 from the alternate screen. This situation reflects the stochastic nature of biological assays. A subset of siRNA targets will always score particularly well because the siRNAs are robust, the gene is easily silenced, or the gene is simply at a critical juncture in a system or pathway. Other siRNAs that may score strongly in one assay but only moderately in a second assay demonstrate that some pathways may be more resistant or adaptable to change or some genes are not as efficiently silenced as others. Thus, a gene that strongly inhibited infection in one screen did not necessarily strongly inhibit infection in the other screen, but it was highly likely to perform well.
As a standard practice, when comparing genomic screens one must consider the behavior of assay wells which did not make the top tier hit lists. Unlike other –omic technologies such as microarray analysis of gene expression, proteomic studies of protein abundance and genotyping via deep sequencing, the results of siRNA loss-of-function studies are not directly quantitative. The relative strength of the “score” in the phenotypic assay may not be directly related to the abundance of a required target protein. Due to difference in effective concentration, and stoichiometry of reactions involved, the strength of a complex phenotypic score is very unlikely to scale proportionally with protein levels. A target that scores only moderately in a screen may be absolutely required, however due to factors such as siRNA efficiency, protein half-life and message abundance, the protein levels may only be reduced 30% in the course of the assay. This makes generation of “hit lists” a troubling facet of siRNA screening and certainly contributes to overlap and reproducibility. In order to alleviate these issues, the authors of screens should make available the performance of all the wells in a genomic screen at the time of publication, similar to what is currently done for microarrays.
The dual genomic screens identified reproducibility that exceeded 67%. This is the strongest published overlap for two complete screens. The 2×2 pooled siRNA library format and low siRNA concentration provided an efficient assay design to identify strong candidates without the costly validation screening. We demonstrated that each siRNA pool behaves quite reproducibility with respect to VOC, and posit that independent siRNA pools tested against the same system would provide more robust final data sets. The best practice we can promote at this time is that researchers use several analysis strategies, and that all relevant data from each control and each experimental well be provided to future researchers as is the case for microarray and genome-wide association studies.