We have described a TAG microarray with elaborate controls and have characterized its performance in terms of FP and FN rates. We showed that most of the FPs can be attributed to cross-hybridization, at least in the experiments described here. We also found that about half of the FNs can be understood in terms of TAG-associated mutations.
Of all the causes of FPs, reagent contamination by traces of amplified TAGs is the most serious. Contamination reveals itself as FPs (or ‘hits’ in a genetic screen) that later fail to be validated. Results may masquerade as findings that are ‘reproducible’ for many weeks if the contamination spreads to stock solutions or the general laboratory environment. Detection is best carried out by sham (template-free) PCR controls, provided reagents can be prepared and handled without contamination. An alternative is to carry out surveillance hybridizations with sham PCR products prepared without genomic DNA, or to perform dye-flip experiments as described here, although debugging by such methods is expensive. Fortunately, there was little evidence of contamination by TAGs in the results presented here. A different kind of contamination, e.g. the presence of diploid strains in the haploid strain collection [see Supplementary Figure 3 in (4
)], could also potentially account for some of the FPs observed here, but do not explain the ‘wrong-color’ FPs ().
Cross-hybridization is a concern shared by all hybridization technologies. Considering the difficulty of designing a large set of oligonucleotide sequences with minimal mutual cross-hybridization, the 3–6% FP rate we found is fairly low and probably more than adequate for many screening applications, especially when confirmatory screening is available. However, for whole-genome profiling applications, the 300 FPs identified with this FP rate will confound interpretation. Even though most FPs are relatively weak, a few are quite strong (), and conventional statistical criteria (such as the ‘2 SD’ cutoff for P-values < 0.05) are questionable in this setting. The best approach for reducing the FP rate may be to optimize hybridization and washing parameters, but this would have to be achieved without inflating the FN rate. Identifying FPs in advance is an alternative solution, but it is likely that many of the FPs associated with any given subset of hybridized TAGs will overlap with signals from other subsets of TAGs, shielding them from identification. To address this problem, we have partitioned the heterozygous diploid strain collection into a series of subpools. Hybridization of TAGs derived from each subpool to separate microarrays should yield more definitive information on which microarray features are most prone to cross-hybridization. Until then, FPs are best interpreted probabilistically and should be considered in any inferences made with these microarrays.
We made a concerted effort to lower the FN rate through procedures aimed at improving signal strength, e.g. asymmetric PCR and precautions against oxidation. It was not practical to perform the extensive range of control experiments that would be needed to define which if any of these modifications were in fact effective. Nevertheless, our experience has been that with the new protocols, the signal intensities are stronger and more reliable. Even so, TAG-associated mutations impose a lower bound on the FN rate. Assuming that these mutations will not be remediated in the near future, the best approach may be to leverage the most trustworthy information from both UPTAGs and DNTAGs into a single statistic. We and others have compared different ways to accomplish this and have devised empirical procedures that maximize predictive value (B.D. Peyser, R.A. Irizarry, C. Tiffany, O. Chen, D.S. Yuan, J.D. Boeke and F.A. Spencer, manuscript submitted). Assuming that UPTAG and DNTAG data are statistically independent, overall FN rates will be 2.2–3.2%, corresponding to a lack of informative data for 140–200 genes. This estimate is consistent with that obtained by Eason et al
Knowledge of the FPs and FNs of a microarray is essential for informed data interpretation. However, TAG microarrays will typically be used in two-color experiments in which the labeled samples are derived from two comparable pools of yeast strains. The focus of such experiments is the log ratio of the two signal intensities, and knowledge of the distribution of these log ratios is key to identifying the log ratios that are statistically significantly different. We have learned that although FP and FN errors contribute to this distribution, random variability between the two pools is probably more important. This variability depends strongly on how the pools were sampled for measurement as well as on various sources of noise within the signal intensities themselves. This information can only be obtained from dedicated control experiments that closely reproduce typical experimental conditions. For comparisons between a control pool and an experimental pool with just a few missing strains, pilot studies that compare a control pool with a ‘drop-out’ experimental pool (the complement of a ‘spike-in’ pool) may be ideal for this purpose (B.D. Peyser, R.A. Irizarry, C. Tiffany, O. Chen, D.S. Yuan, J.D. Boeke and F.A. Spencer, manuscript submitted).
We add for completeness that the 5-fold replicate features designed into our microarray have a novel application beyond their use as negative or positive (YQL) controls (cf. and ). The novelty lies not so much in the paradigm of calculating standard errors from ‘n = 5’ replicates, as in the more powerful idea of estimating and correcting systematic errors that take the form of irregular biases over the surface of the microarray. Such biases have many potential causes, ranging from manufacturing defects to fingerprints to temperature gradients. Because our replicates are intimately co-mingled with the systematic set of TAGs and yet are in random order, they are well-suited to serve as probes of these biases. We have recently developed software to estimate these biases from replicate data and have found that they account almost entirely for the spatially correlated errors in these microarrays. A statistical analysis of these errors will be presented elsewhere.