Comparisons among arrays are meaningful only after accounting for variability in overall target concentration ("target" is the Affymetrix nomenclature for a labeled cRNA that hybridizes with a probe), hybridization efficiency, staining, etc. The normalization procedure recommended by Affymetrix is to multiply raw signals by a scaling factor such that the trimmed mean (excluding 2% highest and 2% lowest) of signals is always the same (500 in this study). This procedure could be problematic if a variable proportion (>2%) of the signals are beyond the linear range of the system. Another concern about the normalization procedure was that the majority of the targets did not produce signals that were significantly greater than those caused by nonspecific hybridization. After the recommended normalization procedure was applied, we confirmed that there was negligible inter-array variability of the mean signal (with 5% of signals trimmed from both the high and low ends) across the 4629 targets that passed the presence / absence filter (described in the next section). The trimmed mean was 649 ± 14 (standard deviation) arbitrary units for the 6 arrays probing RNA from young muscle, and 643 ± 18 for the 6 arrays probing RNA from older muscle. These data were not used to re-scale the arrays because the variance would have been reduced by less than 2%. Plots of signals from individual arrays versus the average of all 12 arrays generally showed the expected scatter around the line of identity (Figure ), but a few showed systematic deviations either above or below the line of identity for signals > ~104
arbitrary units (worst-case array shown in Figure ). While this problem might be addressed by using a different scaling factor for high-intensity signals [10
], few targets produced such high signals, and the magnitude of the effect was small. Thus, the Affymetrix normalization method was employed without modification.
Figure 1 Individual arrays vs. mean of all arrays A. Scatter plot of signals from 4629 probe sets on a typical array vs. mean signals from all 12 arrays. Line of identity is shown. B. Worst-case scatter plot. Same as plot A, except vertical axis represents a different (more ...)
Exclusion of targets based on the absolute detection algorithm
Microarray Suite estimates probabilities that targets are absent (Pdetection) based on ratios of signals from PM probes to those of MM probes, together with the degree of consistency across the multiple probe pairs for each target. As illustrated in Figure , the average signal from the multiple probe pairs cannot be used to decide whether a target should be considered present or absent. We restricted the data analyses to targets for which Pdetection was less than 0.1 for at least 3 of the 6 samples from either the younger or older group. This filter reduced the number of targets included in the statistical analyses to 4629. While excluding data does not affect the nominal value of P for each comparison made with a t-test or rank sum test, it significantly reduces the estimated false detection rate (see t-Tests and SAM below).
Figure 2 Pdetection vs. signal Signal, in arbitrary units, is the average PM-MM intensity difference across all 8–20 probe pairs within a set. Pdetection is the probability that a target is absent, based on the consistency of PM/MM ratios within a probe (more ...)
Signal method vs. ratio method
When two arrays are compared, the gene expression ratios obtained by the signal method and those obtained by the ratio method (see Background for explanation of terms) are highly correlated. However, the results often differ by more than 1.5-fold (Figure ). The advantage of the signal method is that Microarray Suite provides, for each target, a number describing the level of gene expression (in arbitrary units) that can be used for t-tests or other statistical procedures. However, according to Affymetrix (Microarray Suite 5.0 User's Guide), comparisons between arrays are more precise when the ratio method is used, so the values on the horizontal axis of Figure should be more accurate. The Affymetrix ratio method only provides ratios between two arrays, and does not provide gene expression values for the individual arrays that can be used with standard tests of statistical significance. We therefore extended the ratio method to generate a relative expression score for each target on each array, as follows:
Figure 3 Comparison of two arrays by different methods Horizontal axis shows the ratios between two arrays, for 4629 targets, according to the comparative analysis algorithm, which is the basis of the ratio method. Vertical axis shows ratios between the same arrays (more ...)
The first step was to name one of the arrays as the baseline in the comparative analysis program (Microarray Suite 5.0). Every other array included in the study was then compared with that baseline array. This procedure yielded, for each target, a set of 11 expression ratios (r) representing the relative expression level on each array compared with the baseline array.
The next step was to compute, for each target, the expression level (R) of the baseline array relative to all 12 arrays included in the study. For array #1, R was computed with the formula:
R1 = 12/(1 + r2 vs. 1 + r3 vs. 1 + ... + r12 vs. 1)
The value of 1 in the denominator of this formula represents the comparison of array #1 with itself. The number of arrays is the numerator rather than the denominator in this formula because the Affymetrix comparative analysis program sets the baseline array as the denominator, so that values of r are the inverse of the relevant ratios.
A different array was then named as the baseline. E.g., for array #2 as the baseline:
R2 = 12/(r1 vs. 2 + 1 + r3 vs. 2 + ... + r12 vs. 2)
These steps were repeated until all 12 arrays had been named as the baseline. The values R1 through R12 were then used for comparisons between age groups with t-tests, rank sum tests, and SAM as described below.
For the 4629 probe sets that passed the presence / absence filter, the expression ratios (mean value in old group / mean value in young group) generated by the signal method and those generated by the ratio method were highly correlated (r = 0.89). There also was a fairly good correlation between the signal and ratio methods with respect to the level of statistical significance (log P) of the age-related differences (r = 0.74). The advantage of the ratio method was that it usually reduced the within-group variance so that the same mean difference between young and old was associated with a higher level of statistical significance. The average within-group coefficient of variation (CV, standard deviation / mean) was 20% with the ratio method and 25% with the signal method (average CVs were the same for young and old groups). The distribution of CVs improved significantly with the ratio method (Figure ). Table shows that more differences were detected by the ratio method whether we used t-tests, rank sum tests, or SAM to define significant differences. Moreover, consistency between the initial scan and the antibody-enhanced scan was significantly improved by the ratio method, for both expression ratios and for the statistical significance of differences between young and old (Table ). With the signal method, 38% of the differences significant at P < 0.01 (by t-test) on the initial scan were also significant at P < 0.01 on the antibody-enhanced scan, and 65% were significant at P < 0.05 on the antibody enhanced scan. With the ratio method, 51% of the differences significant at P < 0.01 on the initial scan were also significant at P < 0.01 on the antibody-enhanced scan, and 77% were significant at P < 0.05 on the antibody enhanced scan.
Frequency distribution of coefficients of variation (CVs) Distribution of 4629 CVs obtained by the ratio method (solid bars) and the signal method (open bars). CVs are average of within-group CVs in young and old groups.
Number of differences detected: comparison of signal and ratio methods
Correlation coefficients of results of initial scan and antibody-enhanced scan for 4629 probe sets with respect to expression ratios (mean old / mean young) and statistical significance (from t-tests) of the differences between young and old
A plot of expression ratio vs. statistical significance (Figure ) shows that differences with high statistical significance (by t-test) usually were less than 1.7-fold in magnitude. The validity of at least some of the small differences is demonstrated by the fact that ~1.3-fold differences were detected (P < 0.01) for 17 genes encoding proteins involved in mitochondrial electron transport or ATP synthesis (Table ). This finding replicates our previous research, in which SAGE and quantitative RT-PCR assays demonstrated ~1.3-fold declines in older muscle of several mRNAs encoding components of NADH dehydrogenase, cytochrome c oxidase, and ATP synthase complexes [11
]. For all of these mRNAs, both the signal and ratio methods detected the differences at P < 0.03, whereas the ratio method was a bit more likely to detect them at P < 0.01 (14/17 genes) than was the signal method (12/17 genes, Table ).
Figure 5 Volcano plot Statistical significance by t-tests [-log(P)] vs. expression ratio (mean old / mean young) for 4629 targets that passed the presence / absence filter. Note log2 scale on horizontal axis. Vertical lines represent 2-fold difference between (more ...)
Reduced expression in older muscle of genes involved in energy metabolism (ER = expression ratio = mean value in old / mean value in young)
The P values generated by the t-tests were not adjusted for multiple comparisons. However, a Bonferroni correction would be too stringent for exploratory research [12
]. A useful alternative to P in studies involving thousands of comparisons is the estimated false detection rate, which is the ratio of the expected number of chance differences (P × number of comparisons) to the number of differences observed. If we use P < 0.01 to define a significant difference, we should expect ~46 chance differences (0.01 × 4629 comparisons) if there is no effect of aging on gene expression. Because 124 differences were significant at P < 0.01 (by the ratio method), the estimated false detection rate was 46/124, or 37%. When no presence / absence filter was applied (12533 probe sets included in the analysis), the estimated false detection rate (ratio method) increased from 37% to 73% because there were fewer differences (at P < 0.01) among the "absent" mRNAs than expected by chance (48 observed vs. 79 expected by chance).
Mann-Whitney rank sum tests
The Mann-Whitney rank sum test was used to detect transcripts for which there was little or no overlap of values between groups. This test revealed 107 differences at P < 0.01 (exact P = 0.00866 at rank sum cutoff values) when the ratio method was used. We would expect to find 40 differences by chance alone (0.00866 × 4629 genes), so the false detection rate (40/107 = 37%) is the same as that estimated from t-tests. There were 21 differences significant at P < 0.01 by rank sum tests but not by t-tests according to the ratio method. Thus, for exploratory research being done to generate lists of mRNAs that warrant further study, use of both parametric and nonparametric tests is one way to significantly expand the list.
SAM computes a value, termed d [d = (mean1 - mean2)/(sd + s0)], that is similar to t [t = (mean1 - mean2)/sd]. The "exchangeability factor", s0, minimizes the number of extreme d values among targets with small variances in signal intensity. When absolute signals are analyzed, these small variances are associated with targets that are more difficult to quantify accurately because of low concentrations. We already have filtered most of these targets from the analysis. When data based on the signal method were analyzed, s0 was very small (lowest percentile of the sd values). When relative expression levels were computed by the ratio method, all means were close to 1 regardless of the absolute signal intensity. In this case, s0 was large (53rd percentile of the sd values), and lower sd values were associated with stronger rather than weaker signals. This problem precluded the use of SAM for data normalized in this fashion. However, by multiplying each value of R by the gene-specific mean signal (mean of all 12 arrays), we generated a data set compatible with SAM.
SAM lists genes for which d exceeds (by an adjustable threshold termed Δ) the value that would be expected by chance (de). Values of de are generated by computing the d distribution numerous times with random permutations of the group assignments (we instructed SAM to perform 100 permutations). The average distribution from these permutations defines the values of de. Reducing Δ expands the list of "significant" genes, but also increases the false detection rate. When we chose a value of Δ corresponding to a false detection rate < 20%, there were 124 differences according to the ratio method but only 56 differences according to the signal method. There were 20 differences for which the false detection rate was < 5% when the ratio method was used (including 9 for genes involved in energy metabolism that are listed in Table ), but none when the signal method was used. When no presence / absence filter was applied, the highest-ranked differences had false detection rates of 30% even with the ratio method, and only 34 genes achieved this level. Thus, it is very important to remove noisy data before using SAM.