In most clinical microarray data sets, very little is known about the true expression levels, and without such reference points it is difficult to evaluate the effectiveness of data transformations or statistical methods. However, one possible criterion is that multiple samples taken from the same tumor should be more similar to each other than they are to samples taken from other tumors. In a breast cancer data set previously published by Signoretti et al
], 98 surgical specimens were profiled on Affymetrix HG-U95Av2 arrays. Of these, eighteen samples form nine replicate pairs, in which two samples were taken from adjacent sections of the same frozen block [5
]. After normalizing these data with the robust multi-array average (RMA) algorithm [6
], we performed hierarchical clustering and found that only four of the nine replicate pairs clustered together (Figure ).
Figure 1 Bias correction improves concordance of replicates in a large breast cancer microarray study. Hierarchical clustering was performed, and the number of replicate samples correctly paired (joined at the lowest node) was used as a measure of concordance. (more ...)
In an effort to increase the number of replicate pairs clustering together, we tried various thresholds for filtering out low-variance genes. This common practice is intended to reduce noise by removing genes that are essentially unchanged between samples [7
]. As expected, we found that using increasingly stringent thresholds yielded more replicate pairs clustered together (Figure ). When we removed all but the most variable 100 genes, each of the 9 replicate pairs clustered together (Figure ). Satisfyingly, seven normal (non-tumor) samples also clustered together, which had not occurred with the unfiltered data. Thus, at least in this data set, strong filtering of expression data results in dendrograms that more accurately reflect known relationships between samples. Such filtering might be appropriate if the goal is to extract robust clinical markers. However, for other types of analysis (for example, when seeking information about a predetermined set of genes with known biological function) such extreme filtering may eliminate valuable information, in which case correction of noisy data would be preferable to its removal.
Although it is remotely possible that the dendrogram resulting from the original, unfiltered data (Figure ) indicates biologically relevant relationships, we assumed that the disruption of replicate pairs is an artifact caused by noise in the filtered-out genes. If this noise were entirely random, filtering out genes would be the only way to reduce noise. However, if the noise were systematically biased, it might be possible to characterize the bias and correct the measured expression values. Notably, even a small amount of systematic bias could have substantial influence on clustering, if the bias affects a large enough subset of genes in a coordinated fashion.
Therefore, we searched for evidence of systematic bias in the Signoretti data set. We anticipated that any systematic bias would affect a subset of genes having some common characteristic, rather than a random subset of genes. Such bias would then tend to increase the apparent correlation between pairs of genes within this subset. By visualizing the correlation between pairs of probe sets, as a function of intensity, we observed intensity-dependent correlation bias (Figure ). Remarkably, this intensity-dependent correlation bias was present in all data sets we checked [8
] and was not improved by replacing RMA normalization with Microarray Suite 5 (MAS5), model-based expression index (MBEI) [13
], or GeneChip robust multi-array average (GCRMA) normalization [14
] (Additional data file 1). Importantly, the bias was essentially undetectable after post-summary quantile normalization (PSQN), a secondary normalization step that gives each sample the same distribution of expression values, demonstrating that the bias is related to the distribution of intensities and does not reflect biologically relevant gene relationships (Additional data file 1). Also, PSQN applied to the unfiltered Signoretti data set increased the number of paired replicates from four (out of nine) to five (data not shown).
Figure 2 Spurious intensity-dependent correlations between pairs of genes and between genes and bias metrics. Probe sets were sorted by mean expression and divided into 50 bins of approximately 250 probe sets each. For each possible pair of bins, the median correlation (more ...)
Because PSQN only partially resolved the misclustering of replicates in the Signoretti data, we sought to understand the underlying causes of the bias. Two sources of variation, those arising during hybridization and those intrinsic to the starting RNA, seem especially likely to influence many probe sets in a coordinated manner. During hybridization, a subset of probes is susceptible to nonlinear effects caused by saturation at high intensities or by the noise floor at low intensities [15
]. For these probes, changes in target concentration have a diminished effect on probe intensity compared to probes in the linear regime. Thus, if samples are loaded onto arrays at slightly different concentrations, or are hybridized or washed under slightly different conditions, any subsequent normalization is likely to overcorrect probes in the nonlinear range and undercorrect those in the linear range. The use of multiple probes and robust normalization schemes may reduce the effect of these nonlinear probes, but our observation of intensity-dependent bias (Figure ) indicates that many normalization schemes that do not account for systematic bias may be inadequate in some cases.
A second potential source of bias is the variable quality and quantity of starting RNA, which may be especially problematic in data sets featuring samples from surgical specimens [16
]. For an individual transcript to contribute to the measured signal, it must be intact between the region targeted by microarray probes and the poly-A tail, and the polymerases involved in Eberwine-type amplification/labeling steps must complete their processes over this entire distance. Therefore, any variation in mRNA integrity or in polymerase processivity should affect the measured intensity by a factor proportional to the number of bases between the probe target and the 3' poly-A tail. Also, variable amounts of starting mRNA could affect the final diversity of the amplified cRNA, such that a sample with less starting mRNA would tend to have fewer reliably measured genes and, therefore, a higher signal-to-noise ratio.
In our current analysis we hypothesized that a set of four metrics could characterize the relative amount of bias affecting each sample in a data set. To capture the saturation and noise floor of the raw probe intensities, we used the median and interquartile range (IQR) of the perfect-match probes on the array ('PM median' and 'PM IQR'). The effect of RNA degradation was estimated by the average decrease in expression between 5' probes and 3' probes ('degradation'). Finally, the diversity of starting mRNA was characterized by the IQR of the RMA-summarized expression values ('RMA IQR').
In the Signoretti data set, each of the four bias metrics is correlated with the expression values of more probe sets than expected by chance (Figure ). When applied to the gene expression vectors in multivariate linear regression, the four bias metrics explained 33% of the variance, whereas a set of random vectors would be expected to explain 4%. The correlation between probe sets and the bias metrics was strongly intensity-dependent and, thus, may be the source of the intensity-dependent correlation bias (Figure ). The RMA IQR and degradation bias metrics together explain 44% of the variance in the sample distance matrix, indicating that the dendrogram derived from all genes in Figure was largely driven by technical bias.
Figure 3 The expression values of many probe sets are correlated with each of four bias metrics. For each bias metric, a density curve indicates the distribution of Pearson correlation coefficients between the bias metric and the expression values of each probe (more ...)
Figure 4 The correlation between genes and bias metrics is non-randomly distributed according to mean gene intensity. For each probe set in the breast cancer data set, we calculated the mean intensity (across all 98 samples) and the correlation with a bias metric (more ...)
Since the bias metrics did not correspond to any available clinical covariates (hormone receptor status, HER-2 amplification, tumor grade), we removed the component of the expression matrix explained by the bias vectors (see Materials and methods). The resulting bias-corrected data had substantially less intensity-dependent correlation bias (Figure ), and when we subjected it to hierarchical clustering we found that all nine replicates clustered together (Figure ). In a separate glioma data set with two pairs of replicates [17
], an identical procedure also increased the number of pairs clustering together (Figure ). Thus, this bias correction method seems to improve the correspondence between hierarchical clusters and known relationships between samples without filtering out the majority of the genes.
Figure 5 Bias correction improves concordance of replicates in a glioma gene expression study. Hierarchical clustering was performed as in Figure 1. (a) The unfiltered, uncorrected data results in neither of the two replicate pairs together. (b) Bias correction (more ...)
To investigate the effects of bias correction on single-gene analysis, we used the estrogen receptor (ER) status of breast tumors as a reference point [7
]. Since the ER-positive and ER-negative subtypes are observed in many data sets and on several microarray platforms, the correlation between ER status and the expression level of any given gene should be relatively well conserved between data sets of sufficient size. The correlation of correlations (CC) is a measure of consistency in gene-phenotype association between one data set and another [18
]. We calculated the pairwise CCs between five breast cancer data sets with a large number of patients (n
range 99-289), a common platform (Affymetrix HG-U133A), and ER status annotation [8
]. Bias removal increased the average CC from 0.70 to 0.76, although not all pairs of data sets were improved (Figure ). Notably, it was the data set pairs with an initially low CC that benefited most from bias correction. In the most extreme case the CC between two data sets increased from 0.55 to 0.76 after bias correction (Figure ).
Figure 6 Bias correction increases consistency of correlation with ER status within five clinical breast cancer studies. (a) CC coefficients are plotted for each possible pair of data sets, before and after bias correction. The largest improvement, indicated by (more ...)
As an alternative to our bias correction method, we considered batch adjustment, in which the expression matrix is adjusted, one probe set at a time, to remove any difference in means between the batches. From the scan dates embedded in the CEL files, we inferred that the Signoretti breast cancer data were scanned in three batches. We applied batch adjustment to this data set and found that the number of correctly paired replicates increased from four to five (data not shown). Thus, on this data set, batch adjustment is less effective than our bias correction method. The other data sets used in this study were each scanned in a relatively large number of batches (median, 14; range, 8-31), so batch adjustment was not attempted.