There are some limitations to our study that warrant caution in generalizing the results. First, there were only six spike-in genes with differential ‘expression’ among a total of nearly 17
000 genes in each dataset involving the standard arrays. In particular, these data are well suited to the assumptions underlying the intensity-normalization procedure. In real data, there will typically be more differential expression, and genes will be differentially expressed to varying degrees. We also note that in most datasets, though not all, the six differentially ‘expressed’ genes had medium or high signal intensity. It is difficult to predict how our results might have been different if we were attempting to detect differential expression among low-intensity genes. Finally, this study involved experiments with only four (technical) replicates. It is likely that the results concerning the relative performance of the ranking statistics might have been different with a larger sample size. In particular, the t
-test might have performed better. On the other hand, experiments with few replicates are not unusual with microarrays, so our results certainly have some relevance to current scientific practice.
A drawback to local background adjustment is that it often results in negative intensities. Since negative intensities do not make sense, this in itself suggests a flaw in the procedure (9
). In our analyses, we arbitrarily set negative values equal to 1, a common ad hoc workaround. In datasets produced by GenePix® or ArraySuite, typically 3000 out of nearly 17
000 genes were affected by this truncation. Foregoing background adjustment has the additional advantage of avoiding this problem altogether.
Another data analysis procedure, practiced in various forms, is to use an intensity-dependent selection method [Kerr et al
) is one example]. That is, the threshold for selecting differentially expressed genes varies with spot intensity. We investigated whether such a selection method would change our conclusions, and in particular whether it would improve the performance of the t
-statistic. We evaluated this possibility by examining plots of the t
-statistic against average log intensity (see Supplementary Figure 11). We found that an intensity-based selection method offered some improvement for the t
-statistic, but could not match the performance of more effective ranking statistics.
An additional set of questions one can envisage asking of these spike-in data is how normalization and BA affect estimates of relative gene expression in terms of variance and bias. That is, one could approach an analysis from the perspective of ‘estimation’ rather than ‘detection’. The estimation and detection problems are related, yet distinct (11
). A simple analysis regarding estimation revealed a bias-variance trade-off for using BA, which was also found by Bolstad et al
) with Affymetrix® data. The variance of measurements is larger in the data with BA, but the bias in measurements is larger in the data without BA (Figure ). A simple summary is that ‘subtracting background reduces bias but increases variance’. However, measurement bias may be acceptable, even if it is large, if measured values are a consistent proportion of true values. Unfortunately, these spike-in datasets could not inform us on the consistency of the bias because there were only six differentially expressed genes at only two different levels in these experiments. This limitation of the experimental design is the reason we focused on the detection question in this study.
In this paper, we only consider the ranking of the genes induced by the various statistics. We do not address the question of how to choose a threshold to give acceptable type I and type II error rates. Our goal here is to identify the most promising statistics, but issues remain concerning how to apply them.
It is clear from our results that the datasets used in this study were of varying quality. This is despite the fact that all the assays used the same RNA, and most of them used the same source for microarrays (the standard arrays). This illustrates that there are numerous sources of variation in microarray experiments beyond obvious sources such as RNA and the physical microarrays. Some other sources of variation are differences in hybridization protocols, which are detailed in ‘Toxicogenomics Research Consortium: Standardizing Gene Expression Between Laboratories and Across Platforms’ (Members of the Toxicogenomics Consortium, manuscript submitted). An important and interesting question is to identify the protocol differences that account for the observed differences in data quality. However, this question is outside the methodological focus of this paper. In any case, it is certainly reasonable to prefer conclusions that are based only on the high-quality datasets, such as experiments D and A. However, we note that our primary conclusions concerning the merits of intensity normalization, background adjustment, and different ranking statistics are largely consistent across the various datasets. We believe this supports the robustness of our conclusions.
Therefore, while we advise caution in extrapolating from our results, here is a summary of our primary findings.
- Background adjustment tended to increase the variability of gene expression data and made it harder to detect differentially expressed genes. In our study, the detrimental effect of background adjustment was more or less severe depending on the image analysis program. Using the SPOT program, subtracting background had only a minor effect on the data, and in a few cases slightly improved the detection of differentially expressed genes.
- Intensity normalization produced a small but clear improvement in the ability to detect differential expression. This was especially true if background adjustment had been applied, although BA is not recommended due to finding (1).
- The most generally favorable conditions for identifying differentially expressed genes were to apply intensity normalization without background adjustment and use a robust alternative to the t-statistic such as the S-, B- or BL- statistic or the median.
- Our highest quality datasets could not discriminate among the mean, median, S-, B-, BL-statistics, which all distinguished the differentially spiked-in genes. There was no consistent ‘winner’ among these five statistics across datasets.
- In three of five experiments that were analyzed with two different image analysis packages, SPOT offered some improvement over GenePix® in the ability to detect differentially expressed genes.