|Home | About | Journals | Submit | Contact Us | Français|
This paper compares the type I error and power of the one- and two-sample t-tests, and the one- and two-sample permutation tests for detecting differences in gene expression between two microarray samples with replicates using Monte Carlo simulations. When data are generated from a normal distribution, type I errors and powers of the one-sample parametric t-test and one-sample permutation test are very close, as are the two-sample t-test and two-sample permutation test, provided that the number of replicates is adequate. When data are generated from a t-distribution, the permutation tests outperform the corresponding parametric tests if the number of replicates is at least five. For data from a two-color dye swap experiment, the one-sample test appears to perform better than the two-sample test since expression measurements for control and treatment samples from the same spot are correlated. For data from independent samples, such as the one-channel array or two-channel array experiment using reference design, the two-sample t-tests appear more powerful than the one-sample t-tests.
Recent advances in cDNA microarray technology provide exciting tools for studying the expression levels of thousands of distinct genes simultaneously. There are two main platforms for cDNA microarray: nylon membrane-based filter arrays and chemically coated glass-based arrays. The nylon membrane arrays are hybridized with 33P or 35S-labeled cDNA targets, and glass arrays are hybridized with fluorescent dye-labeled targets. The nylon array is also used with the colorimetric detection (1). The simplest microarray experiment is to study changes in gene expression levels between a reference sample and a treated (toxin or drug) sample. In the experiment, samples of DNA clones with known sequence content are spotted and immobilized onto a glass slide or nylon filter. The mRNA extracted from the tissue cell under study is purified, reversed-transcribed into cDNA and labeled with radioactive markers or with green or red fluorescent dyes. Labeled cDNA hybridizes to the spots containing complementary sequences on the array. After hybridization, the radioactive or fluorescent signal intensities are measured using a phosphoimager or laser scanner, respectively. One intensity is measured on each spot for the radiation-labeled array (one-channel array) while two intensities are measured on each spot for the fluorescence dye-labeled array (two-channel array). In both cases, the intensities are surrogates for the expression levels of genes in the sample under study.
In a two-channel experiment, the same spot is used to assess the expression of a gene for the control sample and the treated sample labeled with red and green (or vice versa), respectively. The expression levels of the two samples can be compared for each gene in an array. The ratio of the fluor intensity for each spot measures the relative abundance of the corresponding gene under two different experimental conditions. In contrast, in a one-channel experiment, the expression levels of a gene for the control and treated samples are measured on two different arrays. The expression levels of the control and treatment arrays are compared to assess the difference between two samples.
It has been recognized that there are many sources of systematic variation, spatial heterogeneity and signal saturation, in assigning expression levels to the measured intensities. The expression levels from the two measurements are not directly comparable. Adjustments of the expression data should be performed prior to statistical analysis. Yang et al. (2) and Irizarry et al. (3) described several normalization methods for the two-channel and one-channel arrays, respectively.
A goal of the microarray analysis is to identify a subset of genes that are differentially expressed between the control and treated samples. Draghici (4) gave a review and comparison of currently proposed methods for detecting the set of differentially expressed genes. In general, a gene is said to be differentially expressed if the ratio in absolute value of the expression levels between the treated group to the control exceeds a certain threshold (5), e.g. 2- or 4-fold change. These genes are classified as altered genes. This approach is deficient in some respects. For example, the ratio at the lower levels can be more different than that at the higher levels. Furthermore, gene expression measurements in hybridization experiments are noisy; e.g. the coefficient of variation in gene expression in mouse liver was found to be >30% among 80% of genes tested and >50% of the 56% of genes (6). Alternatively, Newton et al. (7) proposed an improved method of inferring fold changes by deriving the posterior odds of change within a similar model. Furthermore, the statistical significance testing approach and ANOVA can also be applied to identify differentially expressed genes (8–10).
Standard statistical methods have been used for comparisons of intensity levels among treatments one gene at a time (11). The log-transformed normalized intensities from two groups can be compared using either the one-sample or two-sample t-test (8,9). These two approaches represent different underlying model assumptions. The two-sample t-test assumes the distribution of the log-transformed intensity data in each group is independently and identically normally distributed, while the one-sample t-test assumes that the paired distribution of treated and control groups is normally distributed. In this paper we compare these two different testing approaches and compare them to permutation tests for identifying differentially expressed genes with replicate arrays. All models and methods described here are not restricted to any specific microarray platform or technology.
Assume that the experimental design for two cDNA samples on the array are a control and a treatment sample, for example, the control sample is assigned to the green dye and the treated sample is assigned to the red dye. Because of different labeling efficiencies or different scanning sensitivities to the two dyes, the so-called dye swap design with two arrays is often used to account for dye biases. In the dye swap design, on array 1, the control sample is assigned to the green dye and the treated sample is assigned to the red dye; the dye assignments are reversed on array 2. We first present a model for gene expression data from this type of two-channel cDNA microarray design. We will consider the data from another type of two-channel cDNA microarray design and from a one-channel microarray.
It has been recognized that microarray spot intensity, in general, is approximately log-normally distributed with the standard deviation approximately proportional to the magnitude of intensity (mean), e.g. Black and Doerge (12), Chen et al. (13) and Ideker et al. (14). Furthermore, two intensity measurements of the same spot are correlated. A model for background-subtracted intensities (without a normalization) for a spot (gene) on the array is
Xijc = µiceηijc + εijc
Xijt = µiteηijt + εijt,
where (µic,µit) represents the paired true expression levels at the spot i for control and treated samples. In this model, (ηijc,ηijt) represents the multiplicative error and (εijc,εijt) represents the additive error for spot i and arrays j, i = 1,…, g and j = 1,…, r. For each gene i, we assume that the two error components are independently and identically bivariate- normally distributed,
(ηijc,ηijt) i.i.d.~ N(0,Φi)
(εijc,εijt) i.i.d.~ N(0,Σi),
where Φi and Σi are variance–covariance matrices of (ηijc,ηijt) and (εijc,εijt), respectively, and
Also, the errors (ηijc,ηijt) and (εijc,εijt) are independent of one another. This model is similar to that proposed by Rocke and Durbin (15). The mean, variance and covariance for (Xijc, Xijt) are
This model has an approximately constant coefficient of variation. That is, the standard deviation is approximately proportional to its mean expression level. We refer to this model as Model I.
Frequently, the background-subtracted intensities may have different scales among replicated arrays due to different total amounts of labeled cDNA sample or different sensitivities in scanner setting. In other words, the variation among the genes on the same array may behave more alike. A simple approach to modeling array effects is to model multiplicative error as array-specific effects
(ηijc,ηijt) (ηjc,ηjt) i.i.d.~ N(0,Φi).
Under this model, the covariance between the spots i1 and i2 on the same array j is, Cov(Xi1jk,Xi2jk) = µi1k,µi2k ·
– 1), k = c,t. We will refer to this model as Model II.
In practice, the background-subtracted intensity data are usually log-transformed to improve the normality and to stabilize the variance before statistical analysis. Applying the logarithmic transformation Yijk = log(Xijk) and using the Taylor’s expansion at µ*ik = E(Xijk), the mean, variance and covariance are approximately:
Since µik is generally much larger than σik, the log-transformed intensity Yijk will be approximately normally distributed with mean log(µik) and variance 2ik. This supports using the parametric approach, such as t-test or F-test, to the log-transformed data for identifying differentially expressed genes. In the evaluation and analysis below, the data generated from Models I and II are assumed to be log-transformed (in base 2).
As discussed, there are a number of nuisance factors that can influence the intensity measurements. Typically, a normalization method, such as the median, ANOVA or M versus A plot lowess normalization, is applied to the log-transformed intensity prior to statistical analysis. Let Yijc and Yijt be the background-subtracted and normalized intensity for control and treated samples, respectively. We propose the linear model with two sources of variation,
Yijc = µic + ηijc + εijc
Yijt = µit + ηijt + εijt
Analogous to Model I and Model II, the (ηijc,ηijt) and (εijc,εijt) are assumed to be independent of one another, and both are independently and identically bivariate-normally distributed. The distributions of Yijc and Yijt are
Yijc i.i.d.~ N(µic,2ic + σ2ic) and Yijt i.i.d.~ N(µit,2it + σ2it).
The covariance between Yijc and Yijt is (τiicit + ρiσicσit). The distribution of difference Tij = Yijc – Yijt is normal with mean (µic – µit) and variance σ2i = (2ic – 2τiicit + 2it) + (σ2ic – 2ρiσicσit + σ2it). This model assumes that responses among the spots on the same array are independent. We will refer to this model as Model III.
Similarly, the variation among genes on the same array can be modeled as array-specific effects, that is, ηijc = ηjc and ηijt = ηjt. The variance and covariance are Var(Yijk) = 2k + σ2ik and Cov(Yi1jk,Yi2jk) = 2k, k = c,t. This is known as the liner mixed-effects model. The distribution of difference Tij = Yijc – Yijt is also normal with mean (µic – µit) and variance σ2i = (2c – 2τct + 2t) + (σ2ic – 2ρiσicσit + σ2it). We will refer to this model as Model IV.
The models described above are for data from a two-channel microarray experiment in which the control and treatment samples are hybridized on the same array. These models can be applied to the data either from a one-channel experiment or from a two-channel experiment with reference design (2). In the reference design, all samples of interest (control and treatments) are hybridized on different arrays labeled with the same color dye, while a reference sample labeled with the other color dye is used on every array to hybridize with either a control or a treatment sample. In this design, the relative expression levels of the control-to-reference or treatment-to-reference can be directly computed as observed responses for each array. Thus, like the one-channel, the array consists of one measurement (assuming no replicate spots within an array) for each gene. In either case, the expression data are an independent sample, and the correlations τi and ρi are set to be 0.
Identifying differentially expressed genes between the control and treatment can be formulated in terms of the hypothesis
Hi0: µic – µit = 0 versus Hi1: µic – µit ≠ 0.
The sampling distribution ic – it is used to test the hypothesis Hi0, where ic and it are the means of the r (array) replicates in the control group and r replicates in the treatment group, respectively, and s2ic and s2it are the corresponding sample variances.
For independent control and treatment samples (assuming τi = ρi = 0), the hypothesis is commonly done by computing the two-sample t-statistic (ic – it)/si,2, where si,2 is the standard error estimate of (ic – it), i = 1,…, g. Under the model of an equal variance Var(Yijc) = Var(Yijt), if there is no difference between the two groups, then ic – it has a t-distribution with 2r–2 degrees of freedom, where s2i,2 = (2/r)s2i and s2i = (r – 1) (s2ic + s2it)/(2r – 2) is the common variance estimate. If the two groups have difference variances, then the two-sample unequal variance t-statistic or Welch test is applied (9,10). In this paper, evaluation of the two-sample t-test is based on the model of an equal variance in the two groups.
As discussed, the intensities measured from the same spot are correlated. In such paired control and treatment data, we can apply the one-sample t-test for the two-group comparison. Let Dij = Yijc – Yijt and D̄i be mean of the Dij over the r replicate arrays. If there is no difference between the two groups, then the one-sample t-statistic ti = D̄i/si,1 has a t-distribution with r–1 degrees of freedom, where s2i,1 = s2i/r and s2i is the sample variance of Dij over the r replicates.
The t-test has the highest power to detect a difference if the samples are normally distributed. If the two groups have difference variances, then the two-sample unequal variance t-statistic should be applied. In this case, the use of an equal variance two-sample t-test may be biased. In practice, the distribution of the normalized intensity data may not follow a normal distribution, the permutation tests are generally recommended. The permutation test does not require any distribution assumption. We consider one-sample and two-sample permutation tests using t-statistics.
The model for the t-tests presented in this paper performs a gene-by-gene analysis. It computes the sample variance (standard deviation) for each gene in the analysis. Thus, this approach does not require a constant variance or a constant coefficient of variation across genes.
The example is a cDNA two-channel experiment from a toxicogenomic study of gene expression levels of kidney samples from rats dosed with a drug. The experiment includes six replicate arrays (arrays A1–A6) from a 700-gene rat Phase-1 chip (Molecular Toxicology, Santa Fe, NM). In each array there are four by four grids of 14 × 14 spots. Grids 9–12 are replicates of grids 1–4, and grids 13–16 are replicates of grids 5–8. On the arrays A1–A3, the control samples were assigned to the red dye and treated samples were assigned to the green dye. The dye assignments to the control and treated samples were reversed on the arrays A4–A6. In addition, sequences of five genes from other species different from the one of 700 genes were also spotted on the array to monitor non-specific background binding of labeled RNA. Chen et al. (16) described several normalization methods for this data set. Let yijk denote the base-2 logarithm of the intensity for the i-th gene on the array j in the t-th treatment and k-th dye, i = 1,…, g, j = 1,…, r, k = 1,2 and t = 1,2. For a given array, let s denote the number of disjoint subsets (partitions) in the array, which is based on the spotting pattern matrix generated by a single pin with the size 14 × 14. Denote Ll(j) as the l-th subset (location) on the array j, l = 1,…, s. Chen et al. (16) proposed the subset normalization model
yijk,l = m +Gi + Ll(j) + Ij + Dk + (AD)jk + eijk,l,
where m is the overall average signal, Gi represents the effect of the i-th gene, Ll(j) represents the effect of the location l on the j-th array, Ij represents the effect of the intensity on the array j, Dk represents the effect of the k-th dye and (AD)jk accounts for the effect of array j and dye k. This model is a generalization of the Kerr’s global ANOVA model (17,18); the array effects Aj are decomposed into location and intensity components, Ll(j) + Ij. In this paper, the Ll(j) is estimated by the median, Ij is estimated using the lowess fit and the other effects are estimated using the least-squares estimates. The residuals (normalized intensities) removing the overall effects from the fitted model correspond to the treatment × gene interactions as the effect of interest.
Figure Figure1A1A and B are scatterplots of the mean versus standard deviation of the un-normalized intensities for the control and treatment among the 705 genes, and the fitted lowess regression curves for the control and treatment samples. Figure Figure1A1A shows an approximately linear relationship between the mean and standard deviation before the log-transformation. Applying the log-transformation to stabilize variation, Figure Figure1B1B shows no apparent relationship between the mean and standard deviation. Figure Figure2A2A is a scatterplot of the mean versus standard deviation of the normalized intensities for the two groups. There is no apparent relationship between the mean and standard deviation. Figure Figure2B2B is a scatterplot of the standard deviations between the control versus the treatment. The standard deviations between the two groups for the 705 genes mostly appear to be similar. We also evaluated the correlation between two intensities on the same spot. The mean correlations for the un-normalized data and normalized data are 0.6 and 0.8, respectively; a significant correlation between two samples from the same spot. A normal probability plot is used to assess the normality assumption for this data set. A normal probability plot displays the ordered values of the data set versus the corresponding quantiles of a standard normal distribution. A linear plot would imply that the data are reasonably normal. Figure Figure3A3A and B are the normal probability plots for the control and treatment, respectively. It can be seen that the normalized intensity data appear to be heavy-tailed; this suggests that the data are more similar to a t-distribution than to a normal distribution. Using a permutation test to identify differentially expressed genes should be more appropriate.
The control and treated groups are compared one gene at a time using the one- and two-sample t-tests, and one- and two-sample permutation t-tests. Figure Figure44 displays the p-values of the 700 genes with excluding five housekeeping genes for the four tests. If there is no treatment effect for all genes, i.e. all null hypotheses are true, then the p-values should be uniformly distributed on the interval (0,1). That is, the p-value plot should be a straight line across the diagonal. If a null hypothesis is not true, then its p-value will tend to be small. In Figure Figure4,4, the horizontal line at the y-axis represents the number of significant genes at the correspondent level of significance α = 0.01 (vertical line). Figure Figure44 shows that the distribution of p-values appears to be quite non-uniform and the numbers of significant genes from one-sample tests and two-sample tests are substantially different. However, the behaviors of one-sample t-test and permutation test are very similar, as are the two-sample t-test and the two-sample permutation test. Moreover, the one-sample t-tests appear more powerful than the two-sample t-tests because of a positive correlation between the control and treated samples on the same array.
We conducted a Monte Carlo simulation experiment to evaluate the type I error of four methods for a control and treatment comparison. We generated gene expression levels under Models I–IV with r arrays per group, where r = 3, 5 and 8, and the number genes of in an array is g = 500 and 1000. We assumed an equal variance for the control and treated groups, 2ic = 2it = 2 and σ2ic = σ2it = σ2. The true expression levels µik for each channel at each spot were randomly drawn from a log-normal (base 2) distribution with mean 10 and the standard deviation 1.2 suggested by Hoyle et al. (19). This is based on 16-bit tiff images with the intensities ranging from 0 to 216–1 (from 0 to 65535). The parameter values of bivariate normal distribution (ηijc,ηijt) were 2 = 0.1 and 0.3 with the correlation τi = 0.9 and 0 (one-channel experiment). The parameter values of bivariate normal distribution (εijc,εijt) were σ2 = 0.5 and 1, and ρi = 0.1 and 0 with the constraint τi ≥ ρi. For each simulated data set, the proportion of significances was calculated for the four methods at significance level α = 0.01. One thousand random samples were generated in each analysis. Note that the one-sample permutation test was based on 10000 random samples from the population of all permutations. All simulations were carried out using Fortran 90 programs on Unix systems.
Table Table11 is the average of the proportions of significances for g = 500 and r = 5. Since data from Model III and Model IV give similar results, only the results from Model IV are shown. It can be seen that for τ > 0 or ρ > 0 (correlated model), both the two-sample parametric and two-sample permutation t-tests are conservative. In particular, the two-sample permutation test is very conservative because of small sample sizes. (The averaged proportions of rejections for all models are zero for r = 3, not shown.) Both one-sample parametric and one-sample permutation t-tests perform well, with few exceptions. Table Table22 is the averaged proportions of significance for g = 500 with r = 8. The tests show an overall improvement, as compared with r = 5; the averaged rejections are close to 0.01 for the two-sample permutation test with τ = 0 and ρ = 0 (independent model). The results for g = 1000 and r = 8 are similar (not shown). In summary, the one-sample parametric and one-sample permutation tests are similar, and two-sample parametric and two-sample permutation tests are similar with r = 8. When the data are correlated, both the parametric and permutation two-sample tests are too conservative because the assumption of independence is violated.
We conducted another simulation by varying the distribution of errors. The multiplicative errors (ηijc,ηijt) were drawn from a bivariate t-distribution with mean 0, degree of freedom 3 and correlation τ. For Models I and II, the additive errors (εijc,εijt) were drawn from a bivariate normal distribution with mean 0, variance σ2c = σ2t = 0.3 and correlation ρ. For Models III and IV, the additive errors were from the same bivariate t-distribution as (ηijc,ηijt) with correlation ρ. Table Table33 is the average of the proportion of significance for r = 3, 5 and 8. It can be seen that the parametric tests, which rely on the normality assumption, are too conservative. The one-sample permutation test appears to perform well for r = 5 and 8; the test becomes conservative for r = 3 because of small sample size.
In addition to the type I error, we also examined the powers of the four tests for g = 500, r = 5 and µc = 9 shown in Figures Figures55–8. Figure Figure55 shows that the one-sample parametric and permutation tests are more powerful when the samples are correlated, as expected. Even when the data were generated from a t-distribution, the one-sample permutation test is more powerful than other tests (Fig. (Fig.7).7). When the two groups are independent, the two-sample parametric and permutation tests appear to be more powerful than one-sample tests for all three models (Figs (Figs66 and and8).8). The permutation tests are as powerful as the parametric tests when r ≥ 5. In summary, when the data were generated from normal distributions with five replicates, the one-sample tests can detect a 2-fold change (in log scale) with >90% power and the two-sample test, under independence, can detect a 2-fold change with >95% power. When the data were generated from t-distributions, the powers are only 80% and 20% for one-sample and two-sample tests, respectively.
Intensity data from microarray experiments often involve a variety of random and systematic errors. In order to remove sources of variation, different transformation and normalization methods based on either raw or log-transformed intensity data have been proposed to adjust for stochastic biases. We use Models I and II to model raw expression data. The first order approximation of Model I is
Xijc = µic(1 + ηijc + η2ijc/2 + …) + εijc = µic + µic · ηijc + εijc,
Xijt = µit(1 + ηijt + η2ijt/2 + …) + εijt = µit + µit · ηijt + εijt.
This model has been proposed by Ideker et al. (14). They used the likelihood ratio test approach to identifying differentially expressed genes. The computation of likelihood ratio test is not straightforward; it requires estimating the parameters of the bivariate normal models. In present evaluation, Tables Tables11–3 show that the log-transformed data from Models I or II can be analyzed using the traditional or permutation t-test. The tests seem to perform reasonably well in terms of type I error and power under proper conditions, for example, the two-sample permutation test performs well under an independent model. Models III and IV assume that the log-transformed normalized intensity data are normally distributed. Therefore, the t-test or permutation test can be applied directly. Models I and III assume the two sources of variation for spot intensities are independent. Models II and IV assume a systematic variation due to array-specific effects, such as amount of RNA or different hybridization dates, etc. Model IV may be more appropriate for some normalization methods, such as median or lowess, that are array dependent.
Because of lack of replications, the early approach for assessing differentially expressed genes is based on the ratio of the treatment-to-control to determine significant genes. This concept leads to the use of the one-sample t-test for the analysis of data from two-color dye-swap experiments. Alternatively, the two-sample t-test has also been used to detect genes with differential expression (9). This paper demonstrates that the two-sample t-test (either parametric or permutation test) is conservative when the samples are correlated (Example data set). For a two-color dye swap experiment, the one-sample tests appear to perform better than the two-sample tests. On the other hand, when the expression data are independent observations, such as one-channel microarray or two-channel reference design, the two-sample t-test is more powerful.
When the number of arrays is sufficient, the permutation test performs better than the corresponding parametric test when the data do not follow a normal distribution. In practice, the distribution of the normalized intensities appears to have a t-distribution rather than a normal distribution. With a small sample size, however, permutation tests can produce a skewed or bimodal reference distribution. For example, at replicate arrays r = 3, only 20 permutations are possible. When the number of replicates is small (r ≤ 3), the permutation test is not recommended.
The power study assumes a constant effect (mean difference) for all genes and evaluates the average proportion of significances of the given effect. In practice, the majority of genes do not express differentially between treatment groups. Furthermore, the genes that would be affected by a treatment generally have different effects. Different effects will result in different powers. However, the conclusions summarized above should remain valid.
The authors wish to thank Drs Frank Sistare and Karol Thompson for providing the example data set.