In the past few years, many groups have successfully conducted multiple genome-wide RNA interference (RNAi) screenings in
C. elegans,
D. melanogaster and mammals, using either whole animal or cell lines to investigate a full array of biological processes at the systems level [
1-
4]. Compared with classical genetic screens, such as transposon-mediated mutagenesis and somatic clonal analysis [
5-
7], RNAi technology is revolutionary in that it allows investigators to quickly interrogate the phenotype changes that occur upon knocking down individual genes at the genome scale [
8]. However, similar to many other high-throughput technologies, RNAi screens are not completely flawless. On the one hand, genes may not always be effectively knocked down and will consequently be missed by the screening. We refer to these genes as false negatives (FNs). On the other hand, owing to the tolerance for mismatches and gaps in base-paring with targets, small interfering RNA (siRNA) could possibly target up to hundreds of sequences [
9,
10], which are often termed as off-target effects (OTEs). Such OTEs are believed to be the main reason for false positives (FPs) in RNAi screens. The use of long double-stranded RNAs (dsRNAs) in
Drosophila has been proposed as a means of reducing the occurrence of OTEs [
11]. However, two groups reported that OTEs mediated by short homology stretches within long dsRNAs were prevalent in
Drosophila, and that therefore the effectiveness of dsRNAs for reducing OTEs needs further investigation [
12,
13]. Furthermore, OTEs and low efficacies in knocking down certain genes are not the only sources for FNs and FPs associated with RNAi screens. As a matter of fact, designing a high-throughput RNAi screen involves many levels of decision-making, such as the type and concentration of RNAi reagents, the readout options, and the methodologies and criteria used for hit selections, each of which could affect the quality of the final results [
11]. For example, it has been shown that the adoption of a better analytic method for hit selection may help reduce the rate of FPs and FNs [
14-
17].
Both computational and experimental efforts have been made to identify errors in RNAi screens. For example, Ma et al [
12] and Kulkarni et al [
13] suggested that dsRNAs which contained > = 19-nucleotide(nt) perfect matches to unintended targets or had simple tandem repeats of the tri-nucleotide CAN (N represents any base) might cause OTEs and thus contribute significantly to FPs. Consequently, sequence-based computational analysis can be used to predict potential FPs in RNAi screens. However, such prediction is not applicable to identifying FNs. Moreover, DasGupta and colleagues found that there was a lack of strict correlation between the sequence match of 19 nts and FPs, and they suggested that the "FP results" obtained from dsRNAs that were predicted to have OTEs based on sequence analysis should not be blindly treated as artifacts without further tests [
18]. In their study, to experimentally distinguish true positives (TPs) from FPs, they rescreened hits identified in the original screen using multiple, independent "off-target (OT)-free" dsRNAs. However, such experimental validation has its own drawbacks. First, since not all dsRNAs are effective in knocking down the target genes, failure in validating the original positive hits is insufficient for validating FPs. In fact, they showed that some known regulators of the pathway under investigation were actually missed by the validation screens [
18]. Second, since our knowledge of the mechanisms involved in OTEs is still developing, the successful validation of RNAi hits by so-called "OT-free" dsRNAs might actually be the result of unknown OTEs. Third, validation screens are usually conducted only on the positive hits from primary screens, and FNs cannot be recovered without additional effort.
As diverse genomic data accumulate, integrating RNAi screening results with other genomic information, particularly those represented in the form of networks, may help in identifying FPs and FNs. Network-based analysis has been widely applied to solving many biological problems. For example, methods have been developed using protein-protein interaction networks to predict unknown disease genes [
19-
22], or to diagnose disease subtypes [
20]. A common principle adopted by most of these network-based studies is "guilt by association", i.e., nearby genes in the network are more likely to possess similar functions, or will lead to similar phenotypic changes, when perturbed. Here, we test whether this principle holds for RNAi hits, and if it does, we intend to apply it to addressing the noise issue associated with RNAi screens. We also anticipate that network analysis may help to reveal the underlying mechanisms that link the perturbed genes with the observed phenotype changes, which may not be directly obtainable from the raw screening data. Specifically, by perceiving the cell or organism as a dynamic system composed of interacting functional modules which are defined as discrete entities whose functions are separable from those of other modules [
23], the network information can help us to identify the underlying module structure.
Here we present a comprehensive network analysis using 24 published genome-wide RNAi screens in Drosophila. We first verify the "guilt by association" principle by showing that RNAi hits are significantly more connected than random cases. We then develop a network-based RNAi phenotype scoring method termed NePhe to integrate information from both network topology and RNAi screening results. We demonstrate the effectiveness of NePhe scores in identifying putative FPs and FNs by a novel rank-based test and two case studies. We show how the network information can help identify the underlying modules as formed by the refined hits that potentially explain the RNAi phenotype changes as observed by the screen experiments. Finally, we discuss limitations of our approach and potential follow-up studies.