|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact firstname.lastname@example.org
Saccharomyces cerevisiae knockout collection TAG microarrays are an emergent platform for rapid, genome-wide functional characterization of yeast genes. TAG arrays report abundance of unique oligonucleotide ‘TAG’ sequences incorporated into each deletion mutation of the yeast knockout collection, allowing measurement of relative strain representation across experimental conditions for all knockout mutants simultaneously. One application of TAG arrays is to perform genome-wide synthetic lethality screens, known as synthetic lethality analyzed by microarray (SLAM). We designed a fully defined spike-in pool to resemble typical SLAM experiments and performed TAG microarray hybridizations. We describe a method for analyzing two-color array data to efficiently measure the differential knockout strain representation across two experimental conditions, and use the spike-in pool to show that the sensitivity and specificity of this method exceed typical current approaches.
Introduction of the yeast knockout collections, containing arrayed strains harboring deletion mutations for >95% of predicted open reading frames (ORF), allows systematic genome-wide screens for various phenotypes to be readily accomplished (1–3). One aspect of the knockout collections that facilitates rapid screens is the pair of unique 20mer TAGs within each deletion mutation (4). A gene in each yeast knockout strain (YKO) is replaced with a selectable marker flanked by two TAGs, termed UPTAG and DNTAG. All UPTAGS or all DNTAGs in a sample can be amplified in a PCR using universal primers. Individual YKO representation is subsequently interrogated by hybridizing labeled TAGs to microarrays and observing changes in signal intensity between experimental conditions. This approach has been applied to genome-wide screens for mutant phenotypes (3,5), synthetic genetic interactions (6–10) and synthetic-chemical-genetic interactions (10). TAG microarray approaches are rapid and comprehensive, but systematic optimization of analysis methods is lacking. We address this need here.
The most common application of TAG arrays is comparison of YKO representation in two samples. Typically, samples are co-hybridized on one array using complementary fluorescent labels (Cy5 and Cy3). Various general approaches for two-color arrays have been proposed for quality assessment, background adjustment (11,12) and normalization (13–15). However, in analysis of TAG arrays, each YKO has UPTAG and DNTAG probes; four measurements corresponding to each YKO are obtained. Finding the best way to summarize this information in one quantity reflecting differential YKO representation is not trivial (16). Here we demonstrate the utility of a spike-in experiment by evaluating a simple quality assessment procedure and a novel strategy for combining the UPTAG and DNTAG information in a way that is robust to problematic TAGs. Our results show that implementing these two data procedures can greatly improve the utility of TAG arrays.
Preparation of spike-in pools. Heterozygous YKOs (Research Genetics) were grown on solid media and combined, with the exception of YKOs from plate 259 which were separately mixed in subpools, as shown in Figure 1, before incorporation into pool A or B at appropriate representation levels. Genomic DNA was extracted from samples of each spike-in pool using the Masterpure Yeast DNA kit (Epicentre). Pool A and B TAGs were labeled with Cy3 and Cy5, respectively, and TAG microarrays were hybridized, washed and scanned as described (17).
Analyses were performed using custom scripts written in R, an open-source statistical language (18). GenePix local background intensities were not used for correction because, as suggested by Yang et al. (12), subtracting these severely increases noise (data not shown). Normalization was performed using a procedure similar to the one previously proposed (14). Alternate normalization methods did not impact results (data not shown).
The GEO accession number for microarray data is GSE2832. Data and code necessary to reproduce all the results and figures are available upon request.
To evaluate statistical procedures for TAG microarray data, we tailored defined spike-in pools to resemble expected results in a typical synthetic lethality analyzed by microarray (SLAM) experiment (6). Synthetic lethality is defined as inviability of cells containing two mutations which are individually not required for growth. In a SLAM experiment, viable YKOs in pooled form are compared under two conditions: absence versus presence of a specific second mutation (the ‘query’). The average number of genetic interactions expected in a genome-wide screen has been estimated to be ~35, although several query mutations with interactions exceeding 100 have been analyzed (10,19). Therefore, we designed a pair of pools (‘A’ and ‘B’) with 5758 YKOs at equivalent representation, and a set of 94 YKOs with known differential representation ranging from 1:2⅓ to 1:25 and 1:infinity (Figure 1 and Supplementary Table 1). Additionally, certain YKOs grow slowly, and representation of these in the control SLAM sample is expected to be lower than YKOs with wild-type growth rate. To examine our ability to address these mutants, we designed three representation levels in the control (B) pool: high (about equal to all other strains), medium (8-fold dilute) and low (64-fold dilute). TAGs from pools A and B were amplified with Cy3- and Cy5-labeled primers, respectively. These samples were mixed at equal ratio, such that most TAGs should exhibit equal hybridization, while Cy5:Cy3 ratios that deviate from one are expected for the few differentially represented TAGs. This design allows discovery of the best method to produce a measure of differential representation from hybridization results.
Before addressing differential representation, we document the utility of two filtering steps in data pre-processing. First, we noted TAG-specific hybridization artifacts evident in self-self hybridizations performed to examine the noise distribution. Pool A DNA served as template for preparation of labeled TAGs with both Cy5 and Cy3. Thus, all TAGs were present at equal amounts between channels. Figure 2a shows a scatter-plot of normalized log2 Cy5:Cy3 ratios for corresponding UPTAG and DNTAG probes. Because all these values should be zero plus measurement error, we expect these to be uncorrelated and centered at zero. Figure 2a confirms this except for a few YKOs with extreme values for one TAG. These outliers may have a negative impact on specificity. They are likely to be due to individual tag templates that enter the labeling PCR as contaminants, which are detectable even at very low levels (17).
We determined that these artifacts are consistent across experiments performed with a single batch of labeled primer, but not between different primer batches (Supplementary Figure 1). To create a useful filter, we assumed that the data follow a bivariate normal distribution and defined outliers as TAGs with log ratios three SDs away from zero, using a robust estimate of the SD. If the log ratio data follow a normal distribution, excluding outliers, we expect to inappropriately remove only ~0.5% of the data (32 TAGs). We applied this filter (Figure 2a, red lines) independently to UP- and DNTAGs. Fortunately, the YKOs were designed with two TAGs per gene (except for 192 strains lacking DNTAGs), greatly improving chances that at least one TAG performs adequately. Because non-outlier UP/DNTAGs appear to provide independent measurements, the chance of inappropriately removing both TAGs for the same YKO is less than 0.00001. Using this procedure we defined 193 DNTAGs (purple circles) and 244 UPTAGs (blue circles) as primer-batch specific outliers. Six YKOs had both UP and DN ratios filtered (orange circles).
Next, we considered the effect of TAG-specific hybridization behavior resulting from the presence of nucleotide mutations found in some of the TAGs and universal priming sites (20). This is important because sensitivity will be markedly affected when TAGs fail to provide a meaningful measure of strain representation. Histograms of log2 signal intensity display a bimodal distribution (Figure 2b and data not shown) for UP- and DNTAGs whether Cy3 or Cy5 labeling is used. The lower peak is close to background intensities and contains non-functional TAGs with absent or inefficient hybridization. While TAG sequence discrepancies have been characterized (20), knowledge of the presence and nature of mutations was insufficient to fully predict hybridization behavior (17,20).
The naïve approach to summarizing UP- and DNTAG information is to average their observed log ratios. This solution will yield suboptimal measures when one of the TAGs is non-functional. We propose a procedure exploiting the bimodal distribution of TAG intensities to improve on simple averaging. To determine if a TAG is non-functional we fit a mixture model, as in Irizarry et al. (16), to the log intensity data for the control sample. The model fits two normal distributions to the Cy5 data, one for the lower mode and one for the upper mode. The ‘blank’ (YQL) features (17) define the location and width (mean and SD) of the lower distribution. With this fitted model in place, we can predict the probability that each TAG is ‘present’ (Figure 2b). We consider a DN/UPTAG non-functional when it is predicted absent while the complementary UP/DNTAG is present. We define a weighted average = wUP + (1 − w)DN, where w = 0.5 + [P(UP present) − P(DN present)]/2. Thus, when UP is present (PUP = 1) and DN is absent (PDN = 0), w = 1 and only UPTAG is used (Figure 2c). We describe a less complex procedure in Supplementary Note 1 that uses binary absent or present values (P = 0 or 1) and performs similarly (data not shown). Researchers using unsophisticated analysis software such as spreadsheet applications may prefer the simple procedure.
We compared the performance of these two strategies and use of UP- or DNTAGs alone with Receiver Operating Characteristic (ROC) curves based on the spike-in experiment. For this analysis, nominal ratios below 2-fold were excluded from the list of True Positives. This choice is appropriate because 2-fold representation difference corresponds to a subtle growth defect at the margin of detection in colony measurement (1.25-fold colony diameter difference is predicted by hemispherical colony volume = 2 πr3/3). Supplementary Figures 2–3 present ROC curves with varying stringencies for inclusion as ‘True Positive’, including every spiked-in YKO (1.26-fold or higher). ROC curves in the range of false positives likely to be acceptable (Figure 3a and Supplementary Figure 2) demonstrate that the artifact filtering process has a significant effect on specificity (Supplementary Figure 4 shows the full ROC curves). Additional filtering of non-functional TAGs by the weighted average improves results further (Figure 3b and Supplementary Figure 3).
The effect of two filters, one which removes the systematic artifacts and a second which removes non-hybridizing TAGs, is demonstrated by ratio-intensity plots. A naïve approach to analysis would average UP and DN log ratio to produce a measure M for relative strain representation. By filtering systematic artifacts, noise is significantly reduced (Figure 3c and d, open circles and Supplementary Figure 5). Additionally, combining UP and DN selectively provides increased sensitivity for a number of spiked-in strains (filled shapes).
In summary, we present a spike-in pool design that allows evaluation of various methods for generating measures of differential strain representation. Using this experiment, we determined that the largest factor affecting specificity is the presence of primer batch-specific artifacts, evident in control self-self hybridizations. These artifacts may result from extremely low levels of contaminating TAG sequence template introduced before the labeling PCR. Accidental introduction of contaminants may occur at multiple steps, including the high-performance liquid chromatography (HPLC) column purification of Cy5- and Cy3-labeled primers at their manufacture as well as laboratory manipulation of primer batches during initial stock and aliquot preparation. Because the artifacts are consistent only within batches of primer sets, contamination must occur at initial preparation or during manufacture. Yuan et al. (17) discuss the unusually large dilutions required to prevent contamination in TAG labeling reactions. While the source of these artifacts is uncertain, there are several options for minimizing their effect. The approach we present uses a control hybridization of one DNA sample labeled with both primer sets, such that every TAG is present at equal amounts in the two labeling reactions. Deviations from expected 1:1 ratio can be recognized and filtering is applied.
The methods we describe improve detection of true signal difference between samples; however, they are not perfect. Once primer-batch specific artifacts are removed, noise is increased slightly with the weighted average method compared to averaging (see Supplementary Figure 5d and e). Additionally, the weighted average could cause decreased sensitivity when cross-hybridization occurs for one TAG from a low represented strain. The TAG that accurately reflects the low representation of the YKO may be discounted while the cross-hybridizing TAG is emphasized. These problems could be minimized by improving the criteria for selecting a TAG as non-hybridizing, perhaps by examining behavior across many microarrays. The advantage of this method is that it requires as few as two microarray hybridizations (self-self and experiment) to perform well. We have tested these methods to provide optimal results from SLAM experiments, where YKOs that decrease in representation from control to experimental samples are sought. However, appropriately applied, other TAG microarray experiments should benefit from the procedures we describe.
Supplementary Data is available at NAR Online.
The authors thank Z. Wu, X. Pan, J. Bader and colleagues from our laboratories for stimulating discussions. This work was supported by NIH HG02432 to J.B., R.I. and F.S. which also funded the Open Access publication charges for this article.
Conflict of interest statement. None declared.