The first version of Affymetrix’s analysis software (

3) used an average over probe pairs of the differences

*PM*_{ij} –

*MM*_{ij},

*j* = 1,…,

*J*, for each array

*i*. A robust average was used to protect against outlier probes. Summary statistics, such as this average difference (AD), are motivated by underlying statistical models. A model for AD is

*PM*_{ij} –

*MM*_{ij} = θ

_{i} + ε

_{ij},

*j* = 1,…,

*J*. The expression quantity on array

*i* is represented with the parameter θ

_{i}. AD is an appropriate estimate of θ

_{i} if the error term ε

_{ij} has equal variance for

*j* = 1,…,

*J*. However, the equal variance assumption does not hold for GeneChip probe level data, since probes with larger mean intensities have larger variances (

4). In the latest version of their software (

5), Affymetrix uses a log transformation that is successful at reducing the dependence of the variance on the mean. Specifically, the MAS 5.0 signal is defined as the anti-log of a robust average (Tukey biweight) of the values log(

*PM*_{ij} –

*CT*_{ij}),

*j* = 1,…,

*J*. To avoid taking the log of negative numbers,

*CT* is defined as a quantity equal to

*MM* when

*MM* <

*PM*, but adjusted to be less than

*PM* when

*MM* ≥

*PM*, which in general occurs for about one-third of all probes (

4,

6). A model for MAS 5.0 is log(

*PM*_{ij} –

*CT*_{ij}) = log(θ

_{i}) + ε

_{ij},

*j* = 1,…,

*J*.

A recent paper (

7) reported that variation of a specific probe across multiple arrays could be considerably smaller than the variance across probes within a probe set. In the log

_{2} scale, the between-array standard deviation (SD) is in general five times smaller than the within-probe set SD (

4,

7). To account for this strong probe affinity effect, a multiplicative model,

*PM*_{ij} –

*MM*_{ij} = θ

_{i}_{j} + ε

_{ij},

*i* = 1,…,

*I*,

*j* = 1,…,

*J*, was proposed (

7). The probe affinity effect is represented by

_{j}. For analyses where multiple arrays are available a model-based expression index is defined as the maximum likelihood estimate (under the assumption that the errors follow a normal distribution) of the expression parameters θ

_{i}. This estimate will depend on the probe affinity effects

_{j}, which we can estimate if we have enough arrays. The software package dChip (

http://www.biostat.harvard.edu/complab/dchip/) can be used to fit this model and obtain what we refer to as the dChip expression measure. Outlier probe intensities are removed as part of the estimation procedure (

7).

Using data from a spike-in experiment (described in more detail below) we found that appropriately removing background and normalizing probe level data across arrays results in an improved expression measure motivated by a log scale linear additive model. The model can be written as

*T*(

*PM*_{ij}) =

*e*_{i} +

*a*_{j} + ε

_{ij},

*i* = 1,…,

*I*,

*j* = 1,…,

*J*, where

*T* represents the transformation that background corrects, normalizes, and logs the

*PM* intensities,

*e*_{i} represents the log

_{2} scale expression value found on arrays

*i* = 1,…,

*I*,

*a*_{j} represents the log scale affinity effects for probes

*j* = 1,…,

*J*, and ε

_{ij} represents error as above. Notice that this is an additive model for the log transform of (background corrected, normalized)

*PM* intensities. It is quite different from the additive model in

*PM* –

*MM* that was found unsatisfactory in Li and Wong (

7), most likely because of the very strong mean variance dependence that would be present in such an additive model. A robust linear fitting procedure, such as median polish (

8), was used to estimate the log scale expression values

*e*_{i}. The resulting summary statistic is referred to as RMA. The normalization and background correction procedures used are reported elsewhere (

4,

9). Recent results (

4,

10) suggest that subtracting

*MM* as a way of correcting for non-specific binding is not always appropriate. It is possible that information about non-specific binding is contained in the

*MM* values, but empirical results demonstrate that mathematical subtraction does not translate to biological subtraction. We have found that, until a better solution is proposed, simply ignoring these values is preferable.

There is no gold standard to compare and test summaries of probe level data. For this reason, data from spike-in experiments have been used to assess the technology and to motivate normalization procedures (

1,

11,

12). In a recent paper (

13) a dilution/mixture experiment was used to compare existing expression measures. In a similar way, we used data from spike-in and dilution experiments to conduct various assessments on the MAS 5.0, dChip and RMA expression measures. Specifically, we compare the measures of expression according to three criteria: (i) the precision of the measures of expression, as estimated by standard deviations across replicate chips; (ii) the consistency of fold change estimates based on widely differing concentrations of target mRNA hybridized to the chip; (iii) the specificity and sensitivity of the measures’ ability to detect differential expression, presented in terms of receiver operating characteristic (ROC) curves.

For the dilution study (

http://qolotus02.genelogic.com/datasets.nsf/), two sources of cRNA, human liver tissue and a central nervous system cell line (CNS), were hybridized to human arrays (HG-U95A) in a range of dilutions and proportions (

4). We studied data from six groups of arrays that had hybridized liver and CNS cRNA at concentrations of 1.25, 2.5, 5.0, 7.5, 10.0 and 20.0 µg total cRNA. Five replicate arrays were available for each generated cRNA (

*n* = 60 total).

For the spike-in studies, different cRNA fragments were added to the hybridization mixture of the arrays at different pM concentrations. The cRNAs were spiked-in at a different concentration on each array (apart from replicates) arranged in a cyclic Latin square design with each concentration appearing once in each row and column. All arrays had a common background cRNA. We used data from two different studies, one from Affymetrix (

http://www.affymetrix.com/analysis/download_center2.affx) where 14 human genes were spiked-in at concentrations ranging from 0 to 1024 pM and one from GeneLogic (

http://qolotus02.genelogic.com/datasets.nsf/) where 11 control cRNA fragments were spiked-in at concentrations ranging from 0 to 100 pM.

The GeneLogic spike-in experiment consists of a number of arrays each hybridized to samples with suitable concentrations of 11 different cRNA fragments added to a hybridization mixture consisting of cRNA from the same AML tissue. The 11 control cRNAs were BioB-5, BioB-M, BioB-3, BioC-5, BioC-3 and BioDn-5 (all

*Escherichia coli*), CreX-5 and CreX-3 (phage P1), and DapX-5, DapX-M and DapX-3 (a

*Bacillus subtilis* gene) (

11,

14,

15). The cRNA were chosen to match the target sequence for each of the Affymetrix control probe sets. For example, for DapX (a

*B.subtilis* gene), the 5′, middle and 3′ target sequences (identified by DapX-5, DapX-M and DapX-3) were each synthesized separately and spiked-in at a specific concentration. Thus, for example, DapX-3 target sequence may be added to the total hybridization solution of 200 µl to give a final concentration of 0.5 pM. The 11 control cRNAs were spiked-in at a different concentration on each array (apart from replicates). The 12 concentrations used were 0.5, 1, 1.5, 2, 3, 5, 12.5, 25, 37.5, 50, 75 and 100 pM, and these were arranged in a 12 × 12 cyclic Latin square, with each concentration appearing once in each row and column. The 12 combinations of concentrations used on the arrays were taken from the first 11 entries of the 12 rows of this Latin square. Three replicated hybridizations were carried out for each combination of concentrations of the spiked-in material.

The Affymetrix spike-in experiment was done in a similar fashion. It consists of a series of human genes spiked-in at known concentrations. They represent a subset of the data used to develop and validate the MAS 5.0 algorithm. The Latin square consists of 14 spiked-in gene groups in 14 array groups. The concentration of the 14 groups in the first array group are 0, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 pM. Each subsequent array group rotates the spike-in concentrations by one group, i.e. array group 2 begins with 0.25 pM and ends at 0 pM, on up to array group 14, which begins with 1024 pM and ends with 512 pM. There were three replicates for each concentration combination, except for two combinations for which 12 replicates were formed.

The results presented in the figures and tables were ob tained using the R environment (

16), which can be freely downloaded from

http://www.r-project.org/. All the data (

*cel* files) containing the probe level intensities are available on the World Wide Web as stated above. To obtain the MAS 5.0 expression measures these files were processed with MAS 5.0 software. The software package dChip (

http://www.biostat.harvard.edu/complab/dchip/) was used to obtain the dChip measures. The default PM-only model version was used. The RMA measures were computed using the Methods for Affymetrix Oligonucleotide Arrays R package (

17), which is freely available on the World Wide Web (

http://www.bioconductor.org).