oligonucleotide arrays are a popular platform for the high-throughput analysis of gene expression in mRNA. Nguyen et al [1
] give an introduction to the technology for quantitative scientists. Briefly, an oligonucleotide array contains 11–20 probe pairs for each gene. Probe pairs consist of an oligonucleotide that is a "perfect match" (PM) to a subsequence of the mRNA transcript for a gene and a corresponding "mismatch" (MM) oligo that differs from it in one base in the middle. These MM probes are meant to provide information on cross-hybridization.
Quantitative researchers have proposed a variety of methods for handling probe-level data from Affymetrix®
oligonucleotide arrays. Methods employ different procedures for adjusting for background fluorescence, normalizing the data, incorporating the information from "mismatch" probes, and summarizing probesets (combining all the data from the different probes for a given gene). In particular, the value and proper use of data from MM probes have been subjects of some controversy [2
]. It is important to validate a method for its effectiveness in achieving scientific goals, such as estimating relative gene expression or detecting differentially expressed genes [3
]. Note that different methods may be preferable for different scientific goals [4
Previously, spike-in studies have been used to study the variance and bias of different estimates of relative expression derived from oligo array data. These studies are useful and important, but are not the end of the story. First, spike-in datasets are inherently artificial, and may not realistically represent the operating characteristics of a methodology on real data [5
]. For example, the Affymetrix Latin Square Dataset studied by Bolstad et al [6
] has only 42 genes changing from sample to sample. In addition, this dataset was used to develop several methods, so it is not appropriate to use for validation. Finally, a criterion often not considered in the spike-in studies is the accuracy of measurements across genes
. Instead, Bolstad et al [6
] largely considered measurements across RNA samples for single genes. Obviously, these problems are related, yet they are not identical.
Choe et al [7
] conducted a study using an experiment where 100–200 RNAs were spiked-in at various fold-changes. All RNAs other than the spike-ins had the same level in all samples. Impressively, the authors considered over 100 different combinations of methods for background adjustment, normalization, use of MM probes, and probeset summary methods. Many of the study's conclusions are based on the shared features of the ten best-performing combinations. However, eight of those ten combinations used a normalization based on the known subset of genes that were constant between the RNAs that were compared. Such a normalization scheme could not be implemented in an actual experiment where the identity of unchanging genes is unknown. This casts some doubt on the generalizability of the study's findings. Further concerns about generalizability arise from the study's non-standard RNA production protocol. In addition, one of the study's RNA samples contained unlabeled poly(C) RNA, to unknown effect.
Among evaluations that do not rely on spike-in datasets, Ploner et al [8
] favored methods that produced zero correlation, on average, between randomly selected pairs of genes. Though creative, this criterion unfortunately does not correspond to a scientific question of interest. Furthermore, the criterion might favor methods that "over-normalized" the data – removed signal as well as systematic biases. Shedden et al [5
] identified methods that optimized sensitivity for detecting differentially expressed genes. The authors relied on estimates of false discovery rates rather than using data from an independent validation technique for comparison.
In contrast to the studies that use spike-in datasets, our study is based on a real dataset that was collected to answer biological questions. The studies by Choe et al [7
] and Shedden et al [5
] are directed at identifying the best methods for selecting differentially expressed genes, which are not necessarily the best methods for estimating relative expression. In contrast, we focus here on the problem of estimating relative expression. We do not mean to suggest that previous approaches lack merit. Rather, different approaches have advantages and disadvantages, and a plurality of studies is needed.
In our experiment, heart tissue was collected from 24 individual mice in a 2 × 2 design (see Table ). Affymetrix GeneChips® (Murine Genome Array U74Av2) were used to assay RNAs from these tissue samples. Quantitative RT-PCR measurements for 47 genes were taken on these same 24 RNAs. As the "gold-standard" method of measuring gene expression, we treat the qRT-PCR measurements as "truth" for the purposes of this study. In our "overview" investigation, array data were processed in six different ways to arrive at estimates of gene expression among the 24 mice. The six methodologies are MAS5, gcRMA, RMA, VSN, and two versions of dChip (see Table for information on methods and references). We used Pearson Correlation to measure the agreement between array and qRT-PCR measurements on six group comparisons, or "contrasts," among the mice (see the 'Contrasts' section of METHODS and Figure ). In a follow-up investigation, we considered 56 different combinations of the components of these methods.
Table 1 Biological Samples. RNA samples were from an unbalanced 2 × 2 factorial design. The 24 mice were young or old, wild-type or carried the MCAT transgene, which directs overexpression of human catalase to the mitochondrial cellular compartment. Transgene (more ...)
Table 2 Methods under Evaluation. Summary of the six methodologies for oligonucleotide array data that were compared in this study. Details on the methodologies can be found in the references. An asterisk (*) marks components of methods that were studied in the (more ...)
Figure 1 Agreement between array and qRT-PCR for the comparison of Y-WT and O-WT mice. For the Y-WT vs. O-WT contrast, the figure shows estimates of relative expression from the array data, processed with six different methodologies, compared to qRT-PCR. Estimated (more ...)
Our choice to use Pearson's correlation, r, is motivated by the following formula. While not the standard textbook definition of r, a more instructive approximate formula is
where β is the slope of the line for predicting Y from X, Var(X) is the variance of X, and Var(Y|X = x) is the variance of Y in groups that have the same value of X. In our application, X is the measurements from qRT-PCR and Y is the measurements from array. X is fixed, and so also Var(X), but Y depends on what method is applied to the array data. Since Var(Y|X = x) appears in the denominator, a method's performance is improved if it minimizes Var(Y|X = x). Therefore, this metric tends to favor methods with smaller variability. Similarly, the larger the slope between Y and X, the larger r is and the more favorable a method's performance. In this sense, by using Pearson's correlation, we simultaneously take into account both the variance and bias of the measurements produced by arrays. That is, we seek methods that achieve the right balance between variance and bias to yield the strongest association between array measurements and qRT-PCR. However, it is also of interest to specifically examine variance and bias, and we will come back to this.
We note that the 47 genes assayed with qRT-PCR were selected based on primer availability, initial evidence for differential expression, signal intensity, and biological interest. The 47 genes do not comprise a random sample. In particular, the genes for which we have qRT-PCR data do not include low-intensity genes (see Figure and AdditionalFigures.doc for the additional contrasts). The 47 genes are medium- and high-intensity genes, with the larger fold-changes tending to be for the medium-intensity genes. Therefore, our results about inter-platform agreement pertain primarily to high- and especially medium-intensity genes. We will return to this important issue.
Figure 2 Genes selected for qRT-PCR are medium to high intensity in array data. The plot highlights the genes selected for qRT-PCR in a scatterplot of the YWT vs. OWT contrast against the mean signal intensity. Data were processed with gcRMA for this plot. Selected (more ...)