Several preprocessing algorithms have been proposed for Affymetrix arrays, but so far it is unknown whether one particular algorithm provides more accurate results. Several important studies attempting to answer this question using artificially produced RNA samples have been a key source of guidance for investigators 
. Until now, there has been a lack of systematic comparisons of the performance of preprocessing algorithms in investigational studies. In this study, we have evaluated nine preprocessing algorithms in two complementary data sets that are representative of typical microarray experiments that we generate and analyze in our research. Because none of the true gene expression levels are known exactly, we assessed the accuracy of the various preprocessing algorithms by comparing the generated expression values with those measured independently on RT-PCR arrays.
RT-PCR is often used to “confirm” microarray results because of its relatively accurate measurements over a wide dynamic range. However, the interpretation of these results can be problematic because RT-PCR expression values can vary based on choice of normalization controls 
. Although our study has been designed in a way to avoid standard housekeeping gene-based normalization of RT-PCR data, this task is important for accurate results in actual research situations. Despite the fact that no single gene is expressed at a constant level in all biological samples, RT-PCR measurements are often normalized to a single gene. Several publications have introduced more rational normalization methods for RT-PCR. A model-based variance estimation approach was introduced to identify genes with the lowest variance in a given type of data set and therefore best suited for normalization 
. In another approach, the geometric average of multiple control genes was found to be an accurate normalization factor for RT-PCR measurements 
Although the accuracy of preprocessing algorithms for Affymetrix microarrays could be compared in numerous ways, we have used two performance metrics. First, we used the PCC because it is intuitive and because it gives a useful measure of concordance under many conditions. Additionally, we introduced the log-ratio discrepancy, which we believe is a more useful assessment of expression value accuracy in the context of our biomarker research due to its ability to take into consideration the impact of genes that do not change their expression across the various samples in the cohort. Researchers using microarrays for other types of experiments may find other performance metrics to be more relevant to their research, and our results should be considered in this context. For example, the LRD penalizes compression-type artifacts; whereas PCC does not. Such artifacts may be acceptable for simple analysis such as searching for differentially expressed genes between two relatively homogeneous groups. However, such artifacts are not acceptable when the response must be linear, e.g. in principal components analysis. An open question is whether the optimal choice of preprocessing algorithm is dependent on the type of data, or on the biological question being asked. For example, some analysis, such as regulatory network reconstruction, may be highly sensitive to random correlations, in which case concordance with RT-PCR measurements may not be the only consideration for selecting an algorithm 
Naturally, our choice of data sets from colon biopsies and cancer cell lines reflects our research interests. These data sets, like many recent large-scale data sets, were generated using the HG-U133A or HG-U133 Plus 2.0 microarrays, which use a 3′-biased amplification protocol and a transcript-based probe set design. In contrast, the newer generation of microarrays from Affymetrix utilizes a whole-transcript amplification protocol and an exon-based probe design, which may offer a more specific portrait of expressed sequences 
. Preprocessing algorithms that perform well in the U133 arrays should also perform well in the newer arrays, but it will be important to confirm this as researchers adopt the newer platforms.
Several other studies have compared preprocessing algorithms using different data sets and/or different metrics. Cope et al.
described a series of tests to evaluate preprocessing algorithms using the spike-in data set from Affymetrix and the dilution data set from Genelogic. They generally observed better performance from RMA and MAS5 than from dChip (MBEI), which is consistent with our findings. Choe et al.
generated a spike-in data set with a defined background, in contrast to the Affymetrix spike-in data sets, which use HeLa RNA as background 
. Instead of comparing all-in-one preprocessing algorithms as we did, they evaluated the relative merit of the individual background correction, normalization, and summarization steps. Dallas et al.
compared Affymetrix arrays with RT-PCR measurements of 48 genes in various human samples using PCC as the performance metric 
. They evaluated only MAS5 and RMA and found that the two algorithms had comparable performance. Qin et al.
compared microarrays to RT-PCR using mouse heart tissue and Pearson correlation of fold change as a performance metric, and found that MAS5, dChip-with-mismatch, and GCRMA outperformed dChip-without-mismatch (analogous to our “MBEI”), RMA, and VSN 
. Barash et al.
used variability in redundant measurements to estimate noise, and found that RMA outperformed dChip and MAS5 without decreasing the number of differentially expressed genes 
. There is some disagreement between these results and ours, which might stem from differences in data sets and methodology.
It is our impression that the most commonly used algorithms are MAS5, MBEI, and RMA 
. This may be partially due to their relatively early availability (2001–2003) or partially due to their packaging within relatively user-friendly software (MAS5 and MBEI) or fast computation in R (RMA). Therefore, it is reassuring that these three algorithms performed well in our tests, although MBEI was slightly weaker. We were somewhat surprised by the performance of the MAS5 algorithm, which has not always performed well in earlier comparisons 
. Notably, MAS5 is the only algorithm in our comparison that operates on a single array at a time, a feature that is advantageous for diagnostic use but may represent a handicap compared to the other algorithms, which theoretically gain accuracy by considering an entire series of arrays in a single statistical model. The good performance of the MAS5 algorithm has been noted by others 
Overall, if we had to choose a single preprocessing algorithm, we would choose PLIER+16. However, several other algorithms performed in a comparable manner. Ultimately, it is likely that any of the top algorithms would be suitable for most purposes.