Several viral-based shRNA library methods have been described, and have become valuable tools for conducting RNAi screens (reviewed in
1–
3). Microarray synthesis of shRNAs has been used to generate diverse libraries of shRNAs or microRNA-designed shRNAs, which are then cloned, sequence-verified, and arrayed into 96-well plate format
2,
4,
5. To simplify screening, these barcoded shRNA constructs can be used as pools, and the resulting hits identified by recovering and hybridizing the barcodes to a microarray
6–
9. These strategies have been widely implemented with many successes
6–
9, and such libraries are commercially available (e.g. Open Biosystems, Sigma, TRC libraries).
A central shortcoming of existing lentiviral libraries is their low diversity (typically 3–5 shRNAs/gene), which results in high rates of both false negatives and false positives. False negatives occur because currently available algorithms (e.g.,
10) cannot ensure the presence of effective hairpins specific to a target gene. For the same reason, false positives become problematic because more than one effective shRNA/target must be present to rule out off-target effects. The use of high diversity libraries would allow for identification of multiple potent shRNAs/target (a critical control in RNAi experiments
11), while increasing the sensitivity of RNAi screens.
Existing shRNA libraries have typically utilized extensive cloning, sequencing, and often addition of a vector-specific “barcode” sequence to each shRNA before it is included in the library, a time-consuming and costly process. Recent improvements in microarray-based oligonucleotide synthesis allow the production of long oligos (>100 bp) with an error rate of less than 1/250 bp (data not shown). These oligos should allow a direct clone-and-use strategy that would allow easy adoption of changes in RNAi technology, vector choice, or assay design.
Accordingly, we generated pooled shRNA libraries that (1) have highly expanded per-gene coverage (~30 shRNAs/gene), (2) are easy to construct and inexpensive to screen, (3) can be used directly as shRNA pools, and (4) can be readily quantitated by microarray and deep sequencing following a screen (). To this end, we designed a pilot library encoding 22,000 shRNAs to target ~600 genes, including nearly all known human CD antigens (CD antigen shRNA library). To maintain the high fidelity and diversity of the input oligo library, we carefully optimized conditions for PCR amplification, cloning, and propagation of the shRNA library (see methods, and
Supplementary Note 1).
To determine the mutation frequency in the shRNA library, we sequenced 122 random shRNA inserts. 64% of the clones were correct (
Supplementary Table 1). The errors in the remainder generally consisted of 1–2 nucleotide mutations or deletions. While it is likely that many of the imperfect shRNA sequences retain effectiveness in gene knockdown
12–
14, these mutants can be identified by deep sequencing and removed from downstream analysis. We repeatedly created CD antigen shRNA libraries, with 60–80% correct shRNA sequences (
Supplementary Table 1).
PCR amplification can lead to a reduction in complexity of an shRNA mixture during multiple cycling steps. To assess the library complexity, we deep sequenced PCR-amplified shRNAs and identified ~95% of the expected shRNAs (
Supplementary Fig. 1), with error rates consistent with our previous single clone measurements (
Supplementary Table 2). It is also possible to monitor these libraries using microarrays and improved half-hairpin probes
7–
9. Either approach would allow the direct identification of shRNAs, and eliminate the need to independently barcode each vector.
Since we obtained a reasonably low error rate and nearly complete shRNA coverage, it seemed likely that the libraries could be used directly in RNAi experiments. As a functional test, we infected human Raji B cells with the CD antigen shRNA library, and sorted infected cells (expressing mCherry as a marker for infection) that also displayed reduced expression of CD45. The initial comparison of cells seven days after infection with either the control virus or the shRNA library showed no notable difference during the first sort. However, after two rounds of sorting with cells cultured for seven days between sorts, we observed a substantial enrichment for mCherry+ CD45low cells as compared to cells infected with vector alone (35.9% vs. 9.62%) (). To identify active anti-CD45 shRNAs, we PCR-amplified and cloned the lentiviral shRNA inserts from genomic DNA isolated from the CD45low sorted cells. Of 83 sequenced shRNA clones, 39 targeted CD45 (46%). Given that the starting population of the library contained 0.15% CD45-targeting shRNAs (33 out of 22,000), this represents an enrichment of > 300 fold accomplished in a two step sort procedure.
Although other shRNAs were detected in the enriched fraction, we only detected multiple distinct shRNAs for
CD45, highlighting the power of the expanded library to unambiguously detect ‘hits’ in a single screen. In this initial experiment we recovered 6 unique
CD45-targeting shRNAs in the sorted CD45
low fraction from the total of 33
CD45-targeting shRNAs present in the library. These active shRNAs differed markedly in abundance within the sorted fraction (
Supplementary Table 3). In general, the shRNAs that were recovered in larger numbers were also more potent when they were individually re-tested for
CD45 knockdown (
Supplementary Table 3, Fig. 2). Therefore, the expanded library yielded a diverse population of target-specific shRNAs in a single experiment, and allowed us to obtain information about the potency of each individual shRNA from the relative number of each shRNA recovered.
Recent developments in deep sequencing technology make it possible to simultaneously measure the presence of > 80 million distinct sequences at a typical read length of ~50–70 nucleotides, which is ideally suited to monitor the shRNA sequences in the library described here. A digital readout allows the clear resolution of even very similar shRNA species, which is important for the determination of efficacy of individual shRNAs in high complexity libraries. In addition, mutant shRNAs can be directly detected and discarded from further analysis, which is not possible using microarray hybridization assays. This should be particularly useful when measuring the loss of shRNAs that cause cell death or slow growth (dropout screens), where the presence of inactive mutant shRNAs would complicate the interpretation of results. Finally, the large capacity of deep sequencing allows the detection of subtle changes in abundance within genome-scale populations of shRNAs.
To evaluate the capacity of deep sequencing to accurately measure sequence abundance over a broad range of concentrations, we performed a dilution series with a known set of 32 oligonucleotides with unique 28-mer sequence tags. We could detect a highly linear distribution of oligonucleotide counts over an ~ 1 × 10
6 concentration range (
Supplementary Fig. 2). Sequences which were read less than ~10 times were less reliably measured, but can be accurately measured by simply increasing the sequencing coverage.
The large dynamic range and linearity of this counting approach suggested that deep sequencing could be used to measure the change in abundance of shRNA species at early time points in our test screen. To this end, we used deep sequencing coupled with binned flow cytometry based-sorting to search for shRNAs targeting
CD45. Human Raji B cells were infected with the CD antigen shRNA library and grown for 1 week, after which they were sorted into 6 fractions representing different levels of CD45 expression (,
Supplementary Fig. 3). Genomic DNA was prepared from the sorted fractions, shRNAs were amplified by PCR, and the abundance of each shRNA in the various fractions was assessed by deep sequencing. Remarkably, even though a population of CD45
low cells was undetectable by flow cytometry at this early time point (), we could readily measure substantial enrichment of multiple active anti-
CD45 shRNAs in the CD45
low fractions (). This included all anti-
CD45 shRNAs identified in the previous experiment, which involved two rounds of highly selective sorting performed over several weeks. Importantly, although the shRNAs were present at unequal levels at the beginning of the experiment, normalization of their abundance in CD45
low fractions to a fraction with high CD45 expression could clearly identify enrichment for multiple active anti-
CD45 shRNAs.
The ability to identify multiple active shRNAs specific for each gene is one of the most critical improvements of our approach over existing methodologies, and is a direct consequence of including 33 shRNAs per gene in the library. Taking into account the full range of shRNAs for each gene, a rigorous statistical test can be performed to differentiate true hits from genes that by chance have one or two off-target shRNAs. This allows the assignment of a P value to every screened gene (rather than to single shRNAs). Using this method, we could readily resolve anti-
CD45 shRNAs from the rest of the library (); (
P < 2 × 10
−7, see
Supplementary Note 2). In contrast, the large majority of shRNAs were not significantly enriched in any fraction. This result was not unique to a particular CD antigen or cell type, as we obtained similar results when sorting for different CD antigens (LAIR1/CD305 in U937 cells or CD3 in Jurkat cells).
LAIR1 and
CD3-specific shRNAs could be clearly resolved from the rest of the library (
P = 2.6 × 10
−5 and
P = 1.1 × 10
−7, respectively)(
Supplementary Fig. 4).
To further test the ability of our method to enrich for active target-specific shRNAs, we individually cloned and analyzed the potency of 33 anti-
CD45 shRNAs predicted by an algorithm
15 (
Supplementary Fig. 5). Only about 50% of these had minimal activity (> 25% knockdown), and only 6 had > 60% knockdown. In general, while we could see substantial enrichment for active shRNAs after a single sort, there was little enrichment for shRNAs with low activity. Nonetheless, the correlation between activity and enrichment was not perfect, possibly because of off-target effects of some shRNAs.
Interestingly, the highly active shRNAs were not restricted to those predicted to be most active by the algorithm; indeed, the most active species were often quite low on the list. We observed similar results when testing shRNAs directed against
LAIR1 (
Supplementary Fig. 6). This analysis illustrates a key advantage of our expanded-coverage library: without including 30 shRNAs per gene it would have been impossible to predict enough functional shRNAs to corroborate hits. As data accumulates on which hairpins are most active, shRNA prediction algorithms can be improved and library sizes reduced.
In summary, our approach provides an efficient method for rapidly creating and screening shRNA libraries, which addresses both false negative and false positive problems that commonly plague RNAi screens. With only ~20–30% of predicted hairpins giving > 50% knockdown (
Supplementary Fig. 5–6), low complexity libraries will often not have enough shRNAs to corroborate genuine hits in a screen. We show here that high-coverage shRNA libraries can identify many shRNAs targeting a single gene, which increases confidence in hits obtained in RNAi screens. The increased complexity of these libraries can be deconvoluted through deep sequencing. We further show that this method allows detection of active shRNA hits in a model screen without extensive selection, sorting, and cell proliferation, which will greatly facilitate efforts to identify essential genes whose absence may slow growth or cause cell death. Finally, the direct clone-and-use method provides the flexibility to easily remake libraries to immediately incorporate continual advances in RNAi technology, such as various microRNA contexts for expression and improved algorithms for shRNA prediction.