Statistical models for background-subtracted raw intensity data
Assume that the experimental design for two cDNA samples on the array are a control and a treatment sample, for example, the control sample is assigned to the green dye and the treated sample is assigned to the red dye. Because of different labeling efficiencies or different scanning sensitivities to the two dyes, the so-called dye swap design with two arrays is often used to account for dye biases. In the dye swap design, on array 1, the control sample is assigned to the green dye and the treated sample is assigned to the red dye; the dye assignments are reversed on array 2. We first present a model for gene expression data from this type of two-channel cDNA microarray design. We will consider the data from another type of two-channel cDNA microarray design and from a one-channel microarray.
It has been recognized that microarray spot intensity, in general, is approximately log-normally distributed with the standard deviation approximately proportional to the magnitude of intensity (mean), e.g. Black and Doerge (
12), Chen
et al. (
13) and Ideker
et al. (
14). Furthermore, two intensity measurements of the same spot are correlated. A model for background-subtracted intensities (without a normalization) for a spot (gene) on the array is
Xijc = µiceηijc + εijc
Xijt = µiteηijt + εijt,
where (µic,µit) represents the paired true expression levels at the spot i for control and treated samples. In this model, (ηijc,ηijt) represents the multiplicative error and (εijc,εijt) represents the additive error for spot i and arrays j, i = 1,…, g and j = 1,…, r. For each gene i, we assume that the two error components are independently and identically bivariate- normally distributed,
(ηijc,ηijt) i.i.d.~ N(0,Φi)
(εijc,εijt) i.i.d.~ N(0,Σi),
where Φi and Σi are variance–covariance matrices of (ηijc,ηijt) and (εijc,εijt), respectively, and
Also, the errors (η
ijc,η
ijt) and (ε
ijc,ε
ijt) are independent of one another. This model is similar to that proposed by Rocke and Durbin (
15). The mean, variance and covariance for (
Xijc,
Xijt) are
This model has an approximately constant coefficient of variation. That is, the standard deviation is approximately proportional to its mean expression level. We refer to this model as Model I.
Frequently, the background-subtracted intensities may have different scales among replicated arrays due to different total amounts of labeled cDNA sample or different sensitivities in scanner setting. In other words, the variation among the genes on the same array may behave more alike. A simple approach to modeling array effects is to model multiplicative error as array-specific effects
(η
ijc,η
ijt)
![[equivalent]](/corehtml/pmc/pmcents/equiv.gif)
(η
jc,η
jt)
i.i.d.~ N(0,Φ
i).
Under this model, the covariance between the spots
i1 and
i2 on the same array
j is,
Cov(
Xi1jk,
Xi2jk) =
µi1k,
µi2k ·

· (

– 1),
k =
c,
t. We will refer to this model as Model II.
In practice, the background-subtracted intensity data are usually log-transformed to improve the normality and to stabilize the variance before statistical analysis. Applying the logarithmic transformation Yijk = log(Xijk) and using the Taylor’s expansion at µ*ik = E(Xijk), the mean, variance and covariance are approximately:
Since µ
ik is generally much larger than σ
ik, the log-transformed intensity
Yijk will be approximately normally distributed with mean
log(µ
ik) and variance
2ik. This supports using the parametric approach, such as
t-test or
F-test, to the log-transformed data for identifying differentially expressed genes. In the evaluation and analysis below, the data generated from Models I and II are assumed to be log-transformed (in base 2).
Statistical models for log-transformed data
As discussed, there are a number of nuisance factors that can influence the intensity measurements. Typically, a normalization method, such as the median, ANOVA or M versus A plot lowess normalization, is applied to the log-transformed intensity prior to statistical analysis. Let Yijc and Yijt be the background-subtracted and normalized intensity for control and treated samples, respectively. We propose the linear model with two sources of variation,
Yijc = µic + ηijc + εijc
Yijt = µit + ηijt + εijt
Analogous to Model I and Model II, the (ηijc,ηijt) and (εijc,εijt) are assumed to be independent of one another, and both are independently and identically bivariate-normally distributed. The distributions of Yijc and Yijt are
Yijc i.i.d.~
N(µ
ic,
2ic + σ
2ic) and
Yijt i.i.d.~
N(µ
it,
2it + σ
2it).
The covariance between
Yijc and
Yijt is (τ
i
ic
it + ρ
iσ
icσ
it). The distribution of difference
Tij =
Yijc –
Yijt is normal with mean (µ
ic – µ
it) and variance σ
2i = (
2ic – 2τ
i
ic
it +
2it) + (σ
2ic – 2ρ
iσ
icσ
it + σ
2it). This model assumes that responses among the spots on the same array are independent. We will refer to this model as Model III.
Similarly, the variation among genes on the same array can be modeled as array-specific effects, that is, η
ijc = η
jc and η
ijt = η
jt. The variance and covariance are
Var(
Yijk) =
2k + σ
2ik and
Cov(
Yi1jk,
Yi2jk) =
2k,
k =
c,
t. This is known as the liner mixed-effects model. The distribution of difference
Tij =
Yijc –
Yijt is also normal with mean (µ
ic – µ
it) and variance σ
2i = (
2c – 2τ
c
t +
2t) + (σ
2ic – 2ρ
iσ
icσ
it + σ
2it). We will refer to this model as Model IV.
The models described above are for data from a two-channel microarray experiment in which the control and treatment samples are hybridized on the same array. These models can be applied to the data either from a one-channel experiment or from a two-channel experiment with reference design (
2). In the reference design, all samples of interest (control and treatments) are hybridized on different arrays labeled with the same color dye, while a reference sample labeled with the other color dye is used on every array to hybridize with either a control or a treatment sample. In this design, the relative expression levels of the control-to-reference or treatment-to-reference can be directly computed as observed responses for each array. Thus, like the one-channel, the array consists of one measurement (assuming no replicate spots within an array) for each gene. In either case, the expression data are an independent sample, and the correlations τ
i and ρ
i are set to be 0.
Test statistics
Identifying differentially expressed genes between the control and treatment can be formulated in terms of the hypothesis
Hi0: µic – µit = 0 versus Hi1: µic – µit ≠ 0.
The sampling distribution
ic –
it is used to test the hypothesis
Hi0, where
ic and
it are the means of the
r (array) replicates in the control group and
r replicates in the treatment group, respectively, and
s2ic and
s2it are the corresponding sample variances.
For independent control and treatment samples (assuming τ
i = ρ
i = 0), the hypothesis is commonly done by computing the two-sample
t-statistic (
ic –
it)/
si,2, where
si,2 is the standard error estimate of (
ic –
it),
i = 1,…,
g. Under the model of an equal variance
Var(
Yijc) =
Var(
Yijt), if there is no difference between the two groups, then
ic –
it has a
t-distribution with 2
r–2 degrees of freedom, where s
2i,2 = (2/
r)
s2i and
s2i = (
r – 1) (
s2ic +
s2it)/(2
r – 2) is the common variance estimate. If the two groups have difference variances, then the two-sample unequal variance
t-statistic or Welch test is applied (
9,
10). In this paper, evaluation of the two-sample
t-test is based on the model of an equal variance in the two groups.
As discussed, the intensities measured from the same spot are correlated. In such paired control and treatment data, we can apply the one-sample
t-test for the two-group comparison. Let
Dij =
Yijc –
Yijt and
i be mean of the
Dij over the
r replicate arrays. If there is no difference between the two groups, then the one-sample
t-statistic
ti =
i/
si,1 has a
t-distribution with
r–1 degrees of freedom, where
s2i,1 =
s2i/
r and
s2i is the sample variance of
Dij over the
r replicates.
The t-test has the highest power to detect a difference if the samples are normally distributed. If the two groups have difference variances, then the two-sample unequal variance t-statistic should be applied. In this case, the use of an equal variance two-sample t-test may be biased. In practice, the distribution of the normalized intensity data may not follow a normal distribution, the permutation tests are generally recommended. The permutation test does not require any distribution assumption. We consider one-sample and two-sample permutation tests using t-statistics.
The model for the t-tests presented in this paper performs a gene-by-gene analysis. It computes the sample variance (standard deviation) for each gene in the analysis. Thus, this approach does not require a constant variance or a constant coefficient of variation across genes.
Example data set
The example is a cDNA two-channel experiment from a toxicogenomic study of gene expression levels of kidney samples from rats dosed with a drug. The experiment includes six replicate arrays (arrays A1–A6) from a 700-gene rat Phase-1 chip (Molecular Toxicology, Santa Fe, NM). In each array there are four by four grids of 14 × 14 spots. Grids 9–12 are replicates of grids 1–4, and grids 13–16 are replicates of grids 5–8. On the arrays A1–A3, the control samples were assigned to the red dye and treated samples were assigned to the green dye. The dye assignments to the control and treated samples were reversed on the arrays A4–A6. In addition, sequences of five genes from other species different from the one of 700 genes were also spotted on the array to monitor non-specific background binding of labeled RNA. Chen
et al. (
16) described several normalization methods for this data set. Let
yijk denote the base-2 logarithm of the intensity for the
i-th gene on the array
j in the
t-th treatment and
k-th dye,
i = 1,…,
g,
j = 1,…,
r,
k = 1,2 and
t = 1,2. For a given array, let
s denote the number of disjoint subsets (partitions) in the array, which is based on the spotting pattern matrix generated by a single pin with the size 14 × 14. Denote
Ll(j) as the
l-th subset (location) on the array
j,
l = 1,…,
s. Chen
et al. (
16) proposed the subset normalization model
yijk,l = m +Gi + Ll(j) + Ij + Dk + (AD)jk + eijk,l,
where
m is the overall average signal,
Gi represents the effect of the
i-th gene,
Ll(j) represents the effect of the location
l on the
j-th array,
Ij represents the effect of the intensity on the array
j,
Dk represents the effect of the
k-th dye and (
AD)
jk accounts for the effect of array
j and dye
k. This model is a generalization of the Kerr’s global ANOVA model (
17,
18); the array effects
Aj are decomposed into location and intensity components,
Ll(j) +
Ij. In this paper, the
Ll(j) is estimated by the median,
Ij is estimated using the lowess fit and the other effects are estimated using the least-squares estimates. The residuals (normalized intensities) removing the overall effects from the fitted model correspond to the treatment × gene interactions as the effect of interest.
Figure A and B are scatterplots of the mean versus standard deviation of the un-normalized intensities for the control and treatment among the 705 genes, and the fitted lowess regression curves for the control and treatment samples. Figure A shows an approximately linear relationship between the mean and standard deviation before the log-transformation. Applying the log-transformation to stabilize variation, Figure B shows no apparent relationship between the mean and standard deviation. Figure A is a scatterplot of the mean versus standard deviation of the normalized intensities for the two groups. There is no apparent relationship between the mean and standard deviation. Figure B is a scatterplot of the standard deviations between the control versus the treatment. The standard deviations between the two groups for the 705 genes mostly appear to be similar. We also evaluated the correlation between two intensities on the same spot. The mean correlations for the un-normalized data and normalized data are 0.6 and 0.8, respectively; a significant correlation between two samples from the same spot. A normal probability plot is used to assess the normality assumption for this data set. A normal probability plot displays the ordered values of the data set versus the corresponding quantiles of a standard normal distribution. A linear plot would imply that the data are reasonably normal. Figure A and B are the normal probability plots for the control and treatment, respectively. It can be seen that the normalized intensity data appear to be heavy-tailed; this suggests that the data are more similar to a t-distribution than to a normal distribution. Using a permutation test to identify differentially expressed genes should be more appropriate.