We designed the multipurpose, cost-effective GMAP to enable researchers to inexpensively and comprehensively collect data from genome-scale pooled gene-dosage modulation screens performed in human, mouse, and yeast cells using commercially available gene modulation libraries on a standard platform. Specifically, the GMAP enables readout of clone enrichments and depletions from pooled screens using the RNAi consortium (TRC) human and mouse libraries [11
], human ORF expression pools [23
], and pooled screens using gene deletion-associated barcodes or ORFs from budding yeast[1
The detection features on GMAP are summarized in Table . The GMAP features encompass the entire TRC1 shRNA human and mouse collections and a portion of the greatly expanded TRC2 collection such that >248,000 unique shRNAs from the TRC collection can potentially be assayed in parallel. To achieve this extraordinary density, GMAP features were synthesized at a 5 μm scale, resulting in a 4.84-fold smaller surface area than the 11 μm features on the TRCBC array, a commercially unavailable microarray that was previously generated to perform pooled shRNA screens on a subset of the TRC1 shRNA collection [14
]. We based the design of the GMAP shRNA features on observations from the TRCBC microarray which included three different, but highly overlapping sequences to detect each shRNA[14
]. One of these three feature designs was superior to the others in tests using engineered pools with known relative shRNA compositions (Additional file 1
, Figure S1). The GMAP chip starts with the best performing feature design and adds one base from the 21-base shRNA stem sequence to extend the feature length from 21 to 22 bases. Three identical replicates per shRNA barcode feature were included on the GMAP microarray. This strategy was used to maximize signal-to-noise and to identify and potentially correct for any subtle shRNA processing inconsistencies without significantly impacting specificity.
Description of the features on the UT GMAP 1.0
In order to achieve consistent results, a simple-to-follow standard operating procedure (SOP) was developed for preparing samples for hybridization (see Additional file 2
for detailed protocols and recipes). As part of the SOP optimization process, a standardized amount of shRNA probe was hybridized to the GMAP array. Hybridizations were performed with quantities of 1 μg, 2 μg, and 3 μg of purified shRNA probe (see Materials and Methods
) that was generated from a plasmid pool of 78,432 (78 k) human shRNAs (Additional file 1
, Figure S2a). 2 μg of probe gave the best signal-to-noise ratio and was therefore chosen as the standard quantity for all 78 k pool hybridizations. For pools of different complexity, the amount of probe applied to GMAP is altered proportionally compared to the 2 μg used for 78 k pools (for example ~1.35 μg for 54 k pools). In addition to optimizing the quantity of probe, a range of hybridization and washing temperatures were tested using 2 μg of 78 k shRNA pool probe. Hybridization was tested at 40°C and 45°C, with washing tested at 30°C and 35°C. Hybridization at 40°C, with array washing at 30°C was found to provide the highest probe signal with minimal signal from features corresponding to shRNAs not included in the pool (Additional file 1
, Figure S2b).
To assess GMAP chip performance with complex shRNA populations, probe generated from the 78 k human shRNA pool or a different 78 k mouse shRNA pool was hybridized and analyzed. Signal intensity histograms and cumulative distribution plots generated from the collected data indicate that the signal intensities for probes corresponding to shRNAs present in the 78 k pools were well resolved from the background signal of GMAP features corresponding to shRNAs not present in the pools (Additional file 1
, Figure S3a,b). The small overlap between the signals for features matched to either of the human or mouse shRNA pools versus the features that have no matching shRNA in the corresponding pools indicates low rates of library composition errors and cross-hybridization. A surface plot of intensities for all shRNA features on the GMAP array for the human 78 k pool versus the mouse 78 k pool indicates that large pools of equal complexity can be easily resolved from each other (Spearman correlation R < 0.01) and from shRNA features not corresponding to either experimental pool (Figure ).
Figure 1 Readout of high complexity shRNA pools. (a) Surface plot of signal intensities for all 248,049 shRNA features on the GMAP microarray following hybridization of probe generated from the human 78 k plasmid shRNA pool or the mouse 78 k plasmid shRNA pool. (more ...)
One major application of pooled shRNA screens is to perform negative selection screens, so called "dropout screens", in cancer cell lines to identify cancer-essential genes that represent potential therapeutic targets. To assess the performance of the GMAP in deconvolving genome-scale shRNA drop-out screens, we first calculated a minimum signal in the reference data for inclusion in analysis. This was accomplished by adding 1.96 standard deviations (the 95% confidence limit) to the mean background log signal measured on human 78 k pool hybridizations, yielding a threshold of log2
= 7.89 (Additional file 1
, Figure S3c). Consequently, 89.9% of the shRNA signals in the reference data exceeded this threshold and were retained for analysis. We simulated shRNA dropout screens by altering the concentration of sub-fractions of the human 78 k pool in two different shRNA dilution experiments. The first approach utilized four different 78 k human pools, each containing the same 78,432 shRNAs. Approximately 70,100 shRNAs were present in equal concentration in all four pools, while an identical sub-fraction of ~8,300 shRNAs were either undiluted (the reference "Even pool"), or diluted 4-fold, 16-fold, or 64-fold (the "4x", "16x" and "64x" pools" respectively). Distribution plots of shRNA log2
signal intensity difference between the diluted pools and the Even pool demonstrate that the diluted subpools can be distinctly resolved from each other (Figure ). In a second approach, we constructed a single 78 k shRNA pool (the "2x-20x" pool) in which, relative to the bulk population, four distinct subsets of ~8400 shRNAs each were diluted 2-fold, 5-fold, 10-fold and 20-fold, respectively. Similar to the previous experiment, distribution plots of shRNA signal difference between diluted pools and the reference population in the Even pool demonstrate that the diluted subpools are resolvable (Figure ). We did observe that the signal reduction in diluted subpools on the GMAP chip is not directly proportional to the dilution factor, indicating that changes in microarray signal are not linearly proportional to shRNA concentration across the full observed signal range. In particular, the subpool dilution signals are spaced more closely, or compressed, than would be expected for a linear concentration versus signal relationship.
To assess the performance of GMAP relative to the lower density and previous generation TRCBC array, we amplified shRNA probes from genomic DNA extracted from human BT-474 cells infected with a ~54,000 shRNA pool of lentiviruses corresponding to all available human shRNA features on the TRCBC chip, divided the probe and hybridized samples to both TRCBC and GMAP chips. Comparison of signal intensity between the two different array formats revealed a Spearman correlation coefficient of R = 0.96 (Additional file 1
, Figure S4). This result confirms that moving from 11 μm feature sizes to 5 μm feature sizes does not cause any significant reduction in signal strength or specificity, agreeing with reported observations on yeast barcode arrays [26
To compare the BAR-seq strategy [20
] with microarray detection for shRNA barcodes, we used Illumina next-generation sequencing technology to enumerate barcodes in the same five human shRNA 78 k pools described above (Even, 4x, 16x, 64x and 2x-20x). Sequencing libraries were prepared from these pools via essentially the same approach used for generating labelled shRNA barcodes for the GMAP array, followed by an additional amplification step to incorporate adapter regions (see Materials and Methods)
. Between 22.3 million and 25.6 million mapped sequence reads for each of the five shRNA plasmid pools were obtained, yielding a median number of reads per shRNA per 78 k pool of between 107 and 162 (Table ). Notably, 73,073 (93.4%) of expected shRNA sequences were detected in the combined data from all five shRNA pools sequenced (>121 million mapped reads) and 68,420 (87.5%) of expected shRNA barcodes were detected in the human 78 k Even pool alone (Table ). Comparison of the sequence read frequency to GMAP intensity signal for shRNA barcodes in the Even pool resulted in a positive Spearman correlation of R = 0.37 (Figure ). By omitting shRNA clones that were not enumerated in the sequencing data by at least 16 reads and with a signal intensity of at least log2
= 7.89 on the GMAP array, the correlation improved to 0.42. While both the sequencing and the chip hybridization methods provide an assessment of shRNA relative concentration that is reproducible, the modest level of correlation between the signals for these two types or readouts indicates that they have significantly different relationships to absolute shRNA concentration.
Summary of shRNA pool deconvolution by next-generation sequencing
Figure 2 Comparison of sequencing and GMAP performance with high complexity shRNA pools. (a) Scatter plot of GMAP array signal intensity (X-axis) versus sequencing read number (Y-axis) for shRNA clones from the human Even shRNA pool. (b) Distribution plots of (more ...)
A major concern in using large shRNA pools with microarray detection strategies is the extent to which cross-hybridization occurs. To address this issue, we measured the frequency of significant signals from GMAP shRNA features that did not correspond to constituents of a pool of shRNA probes. Specifically, we examined features corresponding to the 78 k mouse pool for signal when the 78 k human pool was hybridized. Features on GMAP that had 100% identity to shRNAs in both the mouse and human pools were removed from the analysis. After doing so, we found that only 2637 of 77,690 (3.39%) mouse 78 k pool features had significant signal (log2≥7.89). From this finding, we infer that amongst the human shRNA features, the cross-hybridization rate from human 78 k pooled probe would be similar.
Frequency distribution plots of sequencing read counts for the Even, 4x, 16x, 64x and 2-20x pools (Figure ) show similar characteristics to the distributions generated for the same pools by microarray detection on the GMAP (Figure ) with some exceptions. First, the distributions tended to be slightly wider for sequence data. Second, the distributions for sequence data exhibit a linear relationship that more accurately reflects the actual experimental dilution of shRNAs in the dilution pools. In other words, the distribution curves for sequencing data tend to center over the correct fold-dilution. These observations demonstrate that sequencing barcode pools results in linear, quantifiable signals whereas microarray detection displays nonlinear signal, a behaviour previously observed by Pierce et al.
]. In addition, the dynamic range of signal obtained from GMAP is compressed relative to deep sequencing (Additional file 1
, Figure S5). However, these differences aside, overall, GMAP detection of shRNA sequences was similar to sequencing-based deconvolution in its sensitivity and quantitative reproducibility over a range of shRNA concentrations.
Microarray experiments have historically suffered from subtle to substantial differences in data produced from the same or similar templates when performed on different dates, by different users, or in different locations [28
]. A candidate method to enable detection and/or correction of these differences is to include a standard set of synthetic oligonucleotide probes that are "spiked" into array hybridization mixtures in defined amounts. These hairpin spike-in (HPSI) probes, designed to have identical length and similar sequence characteristics to shRNAs, may be used as yardsticks for artifact detection and data normalization methods. We replicated 25 clusters of 200 HSPI features across the GMAP (Table ). Spatially localized signal intensity fluctuations between HSPI clusters may indicate potentially poor hybridization or washing performance, contamination, or physical damage to the array surface. As a trial, 12 HSPI probes were tested in replicate array hybridizations to examine their hybridization characteristics and it was found that they behaved in a dose-dependent manner similar to labelled shRNA probes (Additional file 1
, Figure S6).
To examine GMAP performance in a simulated cell screening experiment, two large pools of 90,408 shRNAs, each targeting ~18,000 human genes were constructed. For the "90 k Reference" pool, all of the shRNA plasmids were combined at approximately equal concentrations, while a dilution series pool, the "90 k Dilution", contained four sub-sets of ~3,500 shRNAs each that were diluted 2-fold, 4-fold, 10-fold or 20-fold with respect to their counterparts in the 90 k Reference pool. Distribution plots of shRNA log2
signal difference between the 90 k Dilution and the 90 k Reference pools demonstrate that the diluted sub fractions are clearly resolvable (Figure ). We generated lentivirus from the 90 k Reference and 90 k Dilution pools that was subsequently used to infect A549 cells. Genomic DNA prepared from these cells containing integrated shRNA-expressing cassettes was used as template for probe generation and GMAP hybridization. The resulting distribution plots of log2
signal difference between the 90 k Dilution and 90 k Reference pools post-infection (Figure ) were very similar to those achieved with probe generated from plasmid template for the 90 k pools and 78 k pools. A scatter plot comparing data for the 90 k Reference pool derived from plasmid and genomic DNA template reveals excellent correlation (R > 0.97, see Additional file 1
, Figure S7) further demonstrating that reliable results can be obtained from pooled cell-based screening experiments with shRNA pools spanning essentially the entire human genome with coverage of 4-5 shRNAs per gene.
To enable pooled ORF over-expression screening using the GMAP array to detect and quantify ORF sequences, we designed features against 22,449 distinct human ORF sequences in the Mammalian Genome Collection (MGC) (Table and [29
]). Between 1 and 8 probes were designed against each ORF, with a median of 7 probes per sequence. For comparative purposes, we also included up to three additional features found on the human expression profiling Affymetrix Gene 1.0 ST array for 18,088 of these sequences. To assess the GMAP performance with human ORF hybridization, we developed 41 plasmid pools of entry clones (15,347 open reading frames) from the human ORFeome v5.1 collection that were combined to generate a single master pool. Subsequently, 15,347 ORFs were amplified in pooled format with common flanking primers, labelled and hybridized to both the Human Gene 1.0 ST array and GMAP arrays in duplicate (see Materials and Methods
). Signal for features shared between the two arrays was highly correlated (Pearson's correlation coefficient R = 0.953, Figure ), with similar distribution of signal across the features for each array (Figure ) and similar signal-to-noise ratios (data not shown). These results demonstrate that the GMAP has robust reporting of ORF data, and suggests that the GMAP can be used for a number of human gene assays including human ORF/cDNA overexpression screens.
Figure 3 Quality of human ORF features on GMAP chip. (a) Intensity signals from the GMAP array (X-axis) plotted against signals from the Human Gene 1.0 ST array (Y-axis) following amplification of human ORFeome v5.1 pools and hybridization of the resulting probe (more ...)
To compare the dynamic range of signal for huORF and huGene features on the GMAP, five different quantities of probe generated from ORFeome plasmid pools were hybridized to GMAP in duplicate. The resulting data indicated that two-fold changes in probe input produced highly correlated signals across a 16-fold dynamic range (Figure ), thus we combined huORF and huGene features into one feature set. Our methodology to detect ORF sequences on microarrays depends on common flanking primers that amplify entire open reading frames. This provides an opportunity to examine the utility of the three sets of ORF features on the GMAP microarray, including the 22-mer hairpin features designed to detect shRNA barcodes. The signal concordance between sequence identity-conserved hORF features and shRNA features on GMAP was interchangeable across the entire spectrum of probe concentrations (Figure ). This concordance can be further exploited to expand the feature set for each gene to obtain more accurate measurements of ORFs in a given hybridization mix, which is particularly important for discriminating isoforms in a polyclonal library of open reading frames.
GMAP also contains triplicate copies of features for the collection of ~16,000 20 nucleotide Yeast Knockout Collection barcodes[2
], and single copies of ~12,000 25 nucleotide yeast ORF-specific features (Table ). The yeast ORFs represented on this array are identical to those found on the Tag4 barcode array with two distinct probes designed against each of 5718 yeast ORF sequences. These features can serve as additional controls as they are non-homologous to the shRNA or human ORFeome probes on the GMAP. The yeast features on the GMAP display comparable performance to the TAG4 microarray (Affymetrix) from which they are derived (data not shown).
A major hurdle to using a new microarray platform is the informatics associated with data extraction and analysis. The GMAP chip contains several subsets of features, some of which can be organized as traditional Affymetrix feature sets as well as features for interrogation of short input DNA sequences (shRNA barcodes). Extraction of signal from these arrays can be done with the Affymetrix Power Tool (APT) collection. To help aid adoption of the GMAP, we have developed a Java application to allow data reorganization, graphical summaries and downstream analysis of the datasets extracted from these arrays (files and applications available online at http://chemogenomics.med.utoronto.ca/supplemental/gmap/
). This application makes use of a variety of R and Bioconductor libraries (see Materials and Methods)
. The R functions that we have developed for this analysis will also be available as scripts/libraries to allow end users to work directly in the R environment. The current functionality of the application includes methods to extract signals into a single tab-delimited file for subsets of features and GC-matched background signal, normalization procedures, routines to generate signal ratios against one or more reference chips with options to fine-tune analyses and standard or user-defined annotation files for merging analyzed signal.