The analysis of gene expression data generated by microarrays, such as the high-density oligonucleotide microarays produced by Affymetrix (Santa Clara, CA), is an often laborious process in which a basic understanding of molecular biology, computer science and statistics is required. In a typical microarray experiment, RNA obtained under various conditions (patients, treatments, disease states etc.) is hybridised to microarrays. By tagging the RNA with a fluorescent marker, intensity values can be obtained that correspond to the amount of labeled RNA bound to the array. On the widely used Affymetrix platform, gene expression is measured using probe sets consisting of 11 to 20 perfect match (PM) probes of 25 nucleotides, which are complementary to a target sequence, and a similar number of mismatch (MM) probes in which the 13th nucleotide has been changed. The MM probe measurements are thought to comprise most of the background cross-hybridization and stray signal affecting the PM probes.
Normalization of probe intensity values is performed to remove any non-biological variation. The individual probe measurements are then summarized as probe set expression levels, as estimates of the amount of specific mRNA present in the biological sample. Normalization and probe set summarization are statistical procedures for which several methods have been developed. MicroArray Suite (MAS 5.0), a software package provided by Affymetrix, normalizes intensities using a global scaling procedure and measures expression using a one-step Tukey biweight algorithm, which is defined as the anti-log of a robust average of differences between log(PM) and log(MM) [1
]. The same algorithms are implemented in the software package currently provided by Affymetrix, GCOS. One of the first alternatives to this approach was provided by Li and Wong with the dChip-method, which scales the intensity data towards the median intensity in a group of arrays and then uses model-based index estimates, giving variable weight to PM-MM probe pairs of a probe set based on variance between arrays, to measure expression [2
]. Irizarry et al. introduced RMA (robust multi-array average), later followed by GCRMA (GC robust multi-array average). RMA, often preceded by quantile normalization [3
], applies a median polish procedure to PM intensities only in summarization. GCRMA is based on a similar model as RMA but takes into account the effect of stronger bonding of G/C pairs [5
]. An overview of these methods is shown in Table . Other normalization methods, such as the variance stabilizing normalization (VSN, [7
]) and summarization methods, such as PLIER [8
], have been developed, but are less frequently applied.
Overview of several pre-processing methods.
Various studies have been published which assess the differences in outcome of these different data pre-processing methods [9
]. To validate and test pre-processing methods, two publicly available datasets are commonly used. The Latin square dataset provided by Affymetrix [15
] contains spiked-in cRNA's at several concentrations facilitating the assessment of the relation between mRNA concentration and expression value. The GeneLogic dilution series (obtainable on request, [16
]) gives an estimate of the relation between actual and measured differential expression. Based on these datasets, an online benchmark tool has been developed to encourage authors to test their method [17
]. This tool assesses quality of pre-processing using several parameters in five different groups: (1) variability of expression across replicate arrays, (2) response of expression measure to changes in abundance of RNA, (3) sensitivity of fold-change measures to the amount of actual RNA sample, (4) accuracy of fold-change as a measure of relative expression and (5) usefulness of raw fold-change score for the detection of differential expression. Pronounced differences between different procedures have been shown to occur [4
The studies on the Latin square and dilution data were performed using data generated specifically for this purpose, allowing comparisons of specific analyses, showing accurately which methods perform best. In effect, statistical properties of the various estimators are tested. Several authors noted that the use of two special-purpose datasets for calibration of statistical procedures creates a risk on overfitting of the available data and therefore focused on using experimental data to compare methods with respect to the sets of differentially expressed genes found [9
]. This sometimes lead to contradictory results, where for instance a study using the Latin square dataset showed the MAS5.0 method to outperform the dChip method on detecting differentially expressed genes [13
], while a study on experimental data showed the opposite [9
]. Therefore, more work is needed to reliably establish how important the effect of choice of pre-processing method is in every-day practice, especially when analyses such as clustering and classification are applied.
In this paper, we focus on one practical application of microarrays: patient-cohort studies [18
]. In such studies, researchers typically select sets of genes that are differentially expressed between certain known conditions, a supervised analysis. Moreover, unsupervised techniques (not imposing any prior knowledge on the data) such as clustering are applied to detect biological relations between samples or genes by grouping them according to their expression profiles. Often the goal is to obtain a predictor (classifier) for, for instance, prognostically relevant categories, using supervised analysis.
Given that different pre-processing procedures will influence the outcome of these analyses, several questions can be asked, such as: How well is expression measured using a number of different pre-processing methods? What is their effect on the detection of differentially expressed genes, clusters found and classification results? By focusing on practical applications of microarray studies, we hope to give insight into the relevance of different pre-processing procedures to biology-oriented researchers.