|Home | About | Journals | Submit | Contact Us | Français|
There are many options in handling microarray data that can affect study conclusions, sometimes drastically. Working with a two-color platform, this study uses ten spike-in microarray experiments to evaluate the relative effectiveness of some of these options for the experimental goal of detecting differential expression. We consider two data transformations, background subtraction and intensity normalization, as well as six different statistics for detecting differentially expressed genes. Findings support the use of an intensity-based normalization procedure and also indicate that local background subtraction can be detrimental for effectively detecting differential expression. We also verify that robust statistics outperform t-statistics in identifying differentially expressed genes when there are few replicates. Finally, we find that choice of image analysis software can also substantially influence experimental conclusions.
DNA microarrays are a widely used tool in molecular biology. The technology has advanced considerably in the past few years, and a single microarray now commonly provides measurements on ten or twenty thousand genes. Despite great interest in microarrays and associated bioinformatic problems, some fundamental issues remain concerning the best way to analyze microarray data. The nature of error and other sources of variation in these data remain poorly understood and difficult to characterize. These problems are exacerbated by the fact that a microarray study typically contains few biological or technical replicates. Experience suggests that, while theoretical considerations are important, methodological development should be guided by empirical findings.
There is a lack of realistic, empirical validation of methods for the analysis of microarray data (1). Methodologies are typically introduced by examining their performance in real microarray experiments (2–4) or simulated data (4,5). Neither approach is entirely satisfactory. Simulation models are inherently dubious because they often rely on distributional assumptions or idealized models for the error structure. On the other hand, ‘truth’ is unknown with data from an actual microarray experiment, so it is impossible to determine whether a given methodology does better at revealing the right answer. In this paper, we compare different methodologies on real microarray datasets where the two biological samples are identical except for a few known spike-in genes. We put methodologies to the test on real data where the ‘truth’ is known by experimental design. Spike-in studies like those we consider are not a panacea (1), for reasons we discuss later. However, they go a long way towards filling a serious void.
Here, we study analytical methodologies with spike-in experiments using one of the simplest microarray experimental designs, the dye-swap (6). A common experimental objective for dye-swap data is to identify genes that are differentially expressed between two RNAs. In the context of the problem of detecting differential expression, this study has three specific goals: (i) evaluate the effectiveness of a commonly used intensity normalization procedure; (ii) evaluate the effectiveness of subtracting local background; and (iii) assess the performance of different ‘ranking statistics’ for selecting genes with the strongest evidence for differential expression. As a secondary goal, we evaluate the performance of different image analysis programs, although this part of our study is less comprehensive due to limited data.
The data for this study are ten ‘spike-in’ experiments conducted by six different laboratories within the Toxicogenomics Research Consortium (www.niehs.nih.gov/dert/trc/). The RNA from mouse liver tissue was divided into two aliquots (say, RNA1 and RNA2), with 10 Arabidopsis genes ‘spiked-in’ at known relative concentrations. All 10 experiments used the same physical RNA preparation. Of the 10 spike-in genes, four were spiked-in at equal concentrations, three were spiked in at a 1:3 ratio and three were spiked-in at a 3:1 ratio. Ideally, every hybridized microarray should yield three ratios of 3, three ratios of 1/3, and all remaining ratios should be equal to 1. These datasets are analogous to (though much simpler than) the ‘Latin Square’ dataset that has proved extremely valuable in developing methodology for Affymetrix® arrays (7).
In this paper, an ‘experiment’ is a set of four hybridized arrays in a ‘double dye-swap’ arrangement. There were ten experiments, but some experiments were analyzed with two or three image analysis programs, resulting in a total of 18 ‘datasets’. Table Table11 summarizes the datasets available and serves as a guide to the organization of the Results. We note that image analyses performed with GenePix® for datasets E–G were conducted at a contract company, and the image analyses performed with the SPOT program (http://www.cmis.csiro.au/iap/Spot/spotmanual.htm) were conducted by a single data analyst at Duke University. All other image analyses were conducted by the home laboratory.
Some experiments were performed with ‘standard arrays’, others with ‘alternative arrays’. The standard arrays were produced for the consortium at Duke University, although one set (Experiment B) was from a different production run. The alternative arrays were produced or acquired by the home institutions. The standard arrays contained probes for 16909 genes. The alternative arrays contained probes for 7660, 27020, or 17167 genes.
Each spot on each array produces two fluorescent intensity measurements, a measurement for RNA1 and a measurement for RNA2. The ratio of these two numbers nominally measures the relative abundance of the corresponding transcript in the two RNAs. With the ‘double dye-swap’ experimental design, there are four ratios for each gene. For comparing RNA1 to RNA2, there are two red/green ratios and two green/red ratios. In this study, we worked with the log2-ratios.
Image analysis programs for microarrays give some measurement of the nearby background fluorescence in addition to the spot fluorescence. A common (though not universal) practice is to subtract the local background measurements from the spot intensity measurements to get the starting data. The intention is to ‘correct’ for the contribution of background fluorescence inside the spot. This can change the data substantially, yet there has been no thorough evaluation of whether the procedure improves accuracy. Figure Figure11 shows a view of the data from one microarray in our study with and without background adjustment. In this study, we evaluate background adjustment with respect to the experimental objective of detecting differential expression.
A popular pre-processing step for two-color array data is a normalization procedure to make the median log-ratio independent of the spot intensity. The basic procedure was introduced by Yang et al. (2) and employs the statistical function ‘loess’. Such procedures always improve the appearance of the data, but this is not surprising because they are designed to do so. As with background adjustment, it has not been tested whether intensity normalization improves accuracy in estimation or detection. In this study, we evaluate intensity normalization with respect to the experimental objective of detecting differential expression.
Evaluating normalization and background adjustment (BA) leads one to consider four different versions of each dataset: with and without background adjustment, and with and without intensity-normalization. When intensity normalization was not applied, log ratios were centered, by subtraction, to set the median log ratio on each array equal to 0. ‘Normalization’ used herein always refers to intensity normalization.
Analysis of differential expression in a microarray study typically employs a statistic that transforms multiple measurements into a single number that is then used to gauge the evidence for differential expression. We consider six such statistics: the mean, median, t-statistic, S-statistic [related to the SAM software (3)], B-statistic, and BL-statistic. Table Table22 gives definitions and references for these statistics. For further discussion of these statistics, consult the review by Cui et al. (8).
In order to circumvent dealing with p-values and choosing significance levels, we compare the performance of these six statistics without performing formal hypothesis tests. Instead, we consider the ranking of the genes induced by each statistic, which allows us to evaluate the essential ability of a statistic to discriminate differentially expressed genes. Since all genes on the arrays except the six spike-in genes are expected to have zero log ratios on average, an effective statistic should rank the six differentially spiked-in genes as having the most extreme absolute test statistics—ranks 1 through 6. Since we do not perform statistical testing, we refer to the six statistics we study as ‘ranking statistics’.
Intensity normalization produced a small but clear improvement in the ability to detect differential expression for all statistics. See Figures Figures22 and and33 showing results for Experiments D and C. Intensity normalization improved the performance of all six ranking statistics on the data with or without BA.
In general, background adjustment markedly worsened the ability to detect differential expression. We observed a deleterious effect of BA for datasets A–D and F, regardless of whether intensity-normalization was performed. Consider dataset D for an illustration (Figure (Figure2).2). Using the mean as the ranking statistic, for example, the ranks of the six spike-in genes increased from under 8 to over 1800 after background adjustment in absence of intensity normalization. Except for the t-statistic, all ranking statistics performed better on the versions of Dataset D without BA. Figure Figure33 shows similar results for Dataset C. Results for every dataset can be found in the Supplementary Figure 7.
Plots such as Figure Figure11 or other data summaries (Figure (Figure6)6) show that background adjustment dramatically increased the variability for most genes, thereby increasing the denominators of the t-statistics. However, BA barely affected the variability of the spike-in genes, possibly because these tended to have higher intensity. The net effect was that BA improved the performance of the t-statistic. In dataset D without BA, the t-statistic was the worst-performing ranking statistic, but after BA it performed the best. In dataset D with BA and normalization, the t-statistic actually ranked the six spike-in genes as the six most extreme, the ideal situation. However, on the other datasets in this group, foregoing BA and using a different ranking statistic allowed much better detection of differentially expressed genes (see Supplementary Figure 7).
For dataset E, BA offered some improvement in the ranks of the six spike-in genes. But the quality of dataset E was extremely poor overall. In addition, data from two of the four arrays comprising dataset E were noticeably lower quality. (We later learned that these two arrays were scanned with a different scanner.) When we only analyzed the data from the two better-quality arrays, BA was again deleterious in detecting the six spike-in genes. For dataset G, BA slightly improved detection of the spike-ins in general, although again results were poor overall as with dataset E.
Based on an overall assessment of the results described above, we concluded that the most generally favorable versions of these seven datasets for detecting differentially expressed genes were the normalized versions without BA. We therefore compare the performance of the six ranking statistics only on the normalized data without BA. Two of these seven datasets (A and D) were very ‘clean’ and the median, S-, B-, and BL-statistics were all able to discriminate the spike-in genes. In contrast, the spike-in genes were difficult to discriminate in datasets B and E; here the median performed the best, followed by the mean, BL-, S-, and B-statistics. Our data did not indicate a clear ‘winner’ among the test statistics. However, an unambiguous conclusion for datasets A–F (normalized, no BA) was that the t-statistic performed poorly. Table Table3A3A summarizes these results.
Background values produced by SPOT were small compared to foreground and also compared to the background values produced by the other image software. (We used the ‘Morph’ measure of background for SPOT, as recommended by the software's authors.) In most cases, BA with SPOT slightly worsened the ability to detect differentially expressed genes, but in a few cases detection was slightly improved (see Figures Figures44 and and55 and Supplementary Figure 8).
As before, we observed that intensity normalization offered a small but clear improvement in the ability to detect differential expression for all statistics.
In these datasets produced by SPOT, all statistics except the t-statistic performed well on datasets C and D; the mean, median, or BL-statistic performed the best on datasets B, E, and G. Except for dataset F, the t-statistic performed dramatically worse than the other statistics. The poor overall quality of dataset F made the relatively good performance of the t-statistic there unimpressive. Table Table3B3B summarizes the performance of the six ranking statistics on the SPOT datasets. Although these results on the ranking statistics are based on the normalized data without BA, findings change little if based on the BA data due to the small effect of BA for SPOT data.
Experiment E was also analyzed with ArraySuite and Experiment F with QuantArray. For dataset E (ArraySuite), we again saw a deleterious effect of BA for detecting differentially expressed genes. For dataset F (QuantArray), BA had little effect but all together results were poor. As before, we observed that intensity normalization improved the ability to detect differential expression.
No ranking statistic consistently performed best, but the t-statistic performed the worst in all versions of these two datasets (see Supplementary Figure 9). Table Table3C3C summarizes the results for the normalized data without BA.
Finally, we applied our analysis to datasets H, I and J, each of which used a different source for microarrays than Experiments A–G. Unlike the standard arrays, the alternative arrays did not necessarily contain spots for all ten spike-in genes. In addition, there were varying numbers of spots per array for the spike-in genes. There were 2, 48 and 32 spots for each included spike-in gene in datasets H, I and J, respectively. We present results from an analysis in which we considered every spot for a spike-in gene to be a different gene. In other words, we computed the ranks for each spot separately.
This analysis of Experiments H and I produced the same general conclusion as the analyses of Experiments A–G (see Supplementary Figure 11). The t-statistic performed worst among the six statistics, while the other five statistics performed comparably in detecting differential expression. The most favorable version of these two datasets for detecting differential expression was the version without BA and with intensity normalization. While the BA data provided better detection in Experiment J, detection ability was poor in every version of Dataset J. The other ranking statistics performed better than the t-statistic in the normalized version of Dataset J, with or without BA. In whole, these additional datasets supported the generality of our findings.
Experiments B–G provided additional opportunities to evaluate the relative performance of image analysis programs (Table (Table3).3). All six of these experiments were analyzed with both GenePix® and SPOT. For Experiments B and F, SPOT offered slightly worse results, but for Experiments C and D SPOT offered slightly improved results. For dataset E, SPOT offered markedly improved results. Figures are provided in the Supplementary Material.
Experiments E and F were additionally analyzed with a third program. For Experiment E (four arrays), ArraySuite performed better than GenePix®, but not as well as SPOT. When only the two better arrays for Experiment E were analyzed, ArraySuite performed worse than both GenePix® and SPOT. For Experiment F, QuantArray performed worse than GenePix® and SPOT without normalization but better than GenePix® and SPOT with normalization. While results are suggestive, we caution that these comparisons are made on limited data of questionable quality.
There are some limitations to our study that warrant caution in generalizing the results. First, there were only six spike-in genes with differential ‘expression’ among a total of nearly 17000 genes in each dataset involving the standard arrays. In particular, these data are well suited to the assumptions underlying the intensity-normalization procedure. In real data, there will typically be more differential expression, and genes will be differentially expressed to varying degrees. We also note that in most datasets, though not all, the six differentially ‘expressed’ genes had medium or high signal intensity. It is difficult to predict how our results might have been different if we were attempting to detect differential expression among low-intensity genes. Finally, this study involved experiments with only four (technical) replicates. It is likely that the results concerning the relative performance of the ranking statistics might have been different with a larger sample size. In particular, the t-test might have performed better. On the other hand, experiments with few replicates are not unusual with microarrays, so our results certainly have some relevance to current scientific practice.
A drawback to local background adjustment is that it often results in negative intensities. Since negative intensities do not make sense, this in itself suggests a flaw in the procedure (9). In our analyses, we arbitrarily set negative values equal to 1, a common ad hoc workaround. In datasets produced by GenePix® or ArraySuite, typically 3000 out of nearly 17000 genes were affected by this truncation. Foregoing background adjustment has the additional advantage of avoiding this problem altogether.
Another data analysis procedure, practiced in various forms, is to use an intensity-dependent selection method [Kerr et al. (10) is one example]. That is, the threshold for selecting differentially expressed genes varies with spot intensity. We investigated whether such a selection method would change our conclusions, and in particular whether it would improve the performance of the t-statistic. We evaluated this possibility by examining plots of the t-statistic against average log intensity (see Supplementary Figure 11). We found that an intensity-based selection method offered some improvement for the t-statistic, but could not match the performance of more effective ranking statistics.
An additional set of questions one can envisage asking of these spike-in data is how normalization and BA affect estimates of relative gene expression in terms of variance and bias. That is, one could approach an analysis from the perspective of ‘estimation’ rather than ‘detection’. The estimation and detection problems are related, yet distinct (11). A simple analysis regarding estimation revealed a bias-variance trade-off for using BA, which was also found by Bolstad et al. (7) with Affymetrix® data. The variance of measurements is larger in the data with BA, but the bias in measurements is larger in the data without BA (Figure (Figure6).6). A simple summary is that ‘subtracting background reduces bias but increases variance’. However, measurement bias may be acceptable, even if it is large, if measured values are a consistent proportion of true values. Unfortunately, these spike-in datasets could not inform us on the consistency of the bias because there were only six differentially expressed genes at only two different levels in these experiments. This limitation of the experimental design is the reason we focused on the detection question in this study.
In this paper, we only consider the ranking of the genes induced by the various statistics. We do not address the question of how to choose a threshold to give acceptable type I and type II error rates. Our goal here is to identify the most promising statistics, but issues remain concerning how to apply them.
It is clear from our results that the datasets used in this study were of varying quality. This is despite the fact that all the assays used the same RNA, and most of them used the same source for microarrays (the standard arrays). This illustrates that there are numerous sources of variation in microarray experiments beyond obvious sources such as RNA and the physical microarrays. Some other sources of variation are differences in hybridization protocols, which are detailed in ‘Toxicogenomics Research Consortium: Standardizing Gene Expression Between Laboratories and Across Platforms’ (Members of the Toxicogenomics Consortium, manuscript submitted). An important and interesting question is to identify the protocol differences that account for the observed differences in data quality. However, this question is outside the methodological focus of this paper. In any case, it is certainly reasonable to prefer conclusions that are based only on the high-quality datasets, such as experiments D and A. However, we note that our primary conclusions concerning the merits of intensity normalization, background adjustment, and different ranking statistics are largely consistent across the various datasets. We believe this supports the robustness of our conclusions.
Therefore, while we advise caution in extrapolating from our results, here is a summary of our primary findings.
Supplementary Material is available at NAR Online.
We thank two reviewers whose helpful comments led to substantial improvements in this paper. We also acknowledge Pat Hurban of Icoria for his assistance. This work was supported by grants from the National Institute of Environmental Health Sciences (U19ES11387, U19ES11375, U19ES011384, U19ES011391 and U19ES11399) and by the University of Washington Department of Biostatistics Career Development Funds to K.F.K.