|Home | About | Journals | Submit | Contact Us | Français|
Improvements in the quality of gene expression data were investigated based on a database consisting of 5168 oligonucleotide microarrays collected over 3 years. The database includes diverse treatments of human and mouse samples collected from multiple laboratories. The array designs and algorithms used to capture the data have also changed over the 3 years of data collection. All hybridizations and labeling were conducted in the Hartwell Center for Bioinformatics and Biotechnology at St. Jude’s Children’s Research Hospital. Quality metrics for each human and mouse array were collected and analyzed. Statistical tests, such as ANOVA and linear regression, were applied to test for the effects of array design, algorithm, and time. The quality metrics tested were average background, actin 3′/5′ ratio, Bio B signal, percent present, and scale factor. ANOVA results indicate that both recent algorithms and chip designs significantly correlate with improvements in Bio B, scale factor, percent present, and average background. Significant quality improvements correlated with new chip designs, algorithms, and their interaction. In addition, within one chip type analyzed by the same algorithm significant improving trends were still observed. Scale factor, percent present, and average background significantly improved over time for U133A arrays analyzed by the Affymetrix MicroArray Suite 5.0 algorithm according to linear regression. Proportionally fewer outlier arrays (those with less than 25% present calls) were seen over time. Also, high throughput periods did not increase the proportion of outliers, indicating that laboratory monitoring of quality is successfully preventing failures.
The importance of data quality is rarely emphasized, even though data quality directly affects the reliability of experimental results. The time and resources spent detecting, repairing, or repeating low-quality results can be substantial. Despite the undisputed value of quality control (QC), there are relatively few publications on the subject as it pertains to microarray analysis.
Some studies seek to correct data at the spot or sequence level.1–4 Other studies of quality examine the array synthesis process5 or bench protocols.6 Considerable attention has been focused on normalization or transformation methods that identify and attempt to correct perceived data quality problems. These methods are intended to correct systematic technical errors due to physical properties of the arrays.7–8 For example, a normalization method may correct spatial bias across an array9 or remove dye effects in two-color arrays.10
Here we use methods for assessing the quality of large data sets11 and data bases12 to provide insight into the functioning of the arrays and the improvements over time. The value of this report is in the scale of the data set used at St. Jude’s Children’s Research Hospital and the 3-year time span for data collection. The wide range of test conditions and laboratories makes this data set a representative sampling of the RNA samples likely to be tested with human or mouse microarrays.
Using Perl scripts, QC metrics were collected from Affymetrix MicroArray Suite (MAS) version 4.0 and 5.0 report files (.RPT format). There was one QC report file for each Affymetrix Genechip. Data from 5168 human and mouse arrays were retained. Arrays from other organisms were excluded (208 arrays). Next, the data were imported into STATA SE/8.2 (College Station, TX) where statistical analyses were performed. A subset of QC metrics was selected to represent distinct stages in microarray processing. These metrics were then subjected to statistical analysis using analysis of variance (ANOVA) and linear regression as appropriate. First, global trends were graphed and analyzed by ANOVA to detect the influence of array design and data extraction algorithm. Then, data from a single array design analyzed with a single algorithm were examined for time trends with linear regression.
MAS 4.0 and 5.0 report files collected from 11/29/2000 to 01/08/2004 represent many chip types. Table 11 gives a breakdown of the chip type included in this summary analysis. The data collected include three generations of array designs and three analytical algorithms.
Chip types with the “A” designation comprise the majority of the arrays. Further, only U95 and U74 designs were analyzed with all three algorithms. Latter designs were analyzed exclusively with the statistical MAS5.0 algorithm. From Table 11 we note that A-type arrays are much more commonly used than other arrays.
In Figure 11 we observe that the data do not fall into expected ranges. For example, actin 3′/5′ ratios and Bio B spike in values are expected to always be positive. Yet both data sets have a small number of extreme negative values. These negatives values are artifacts of the MAS 4.0 algorithm and not observed in current data.
Affymetrix algorithms provide a report file that includes several measures of quality. These metrics are derived from analysis of each laser-scanned image file (.DAT) and therefore inherit properties of the scanner and the software used to extract data from the image. QC metrics are designed to detect problems at different stages in the RNA labeling and hybridization process. The metrics selected were average background, Bio B, actin 3′/5′ ratio, scale factor, and the percent present calls. The average background is calculated from the 2% probes with the weakest signal. The average background is an estimate of general nonspecific binding based on low-intensity features across an array. Bio B is a probe set designed to measure prelabeled bacterial nucleotides. Bio B is the signal from internal prelabeled standards and measures the efficacy of hybridization, washing, and scanning. Bio B is free of RNA, amplification, and labeling effects. The actin 3′/5′ ratio is a ratio of probe sets designed to detect the 3′ and 5′ regions of the actin mRNA transcript and is reputed to detect RNA degradation. This ratio is thought to indicate RNA quality a well as the bias inherent in the Affymetrix labeling assay. The scale factor is a global normalization constant based on the trimmed mean of probe set signals or average differences and is inversely related to chip brightness. Percent present is an array level summary of the results of a statistical function designed to predict the presence or absence of each transcript. Percent present is a quality metric that is sensitive to any error source from RNA sampling to scanning and data extraction. Percent present is influenced by all stages in the microarray process including scanner brightness, background, RNA quality, algorithm, and chip design (see http://www.affymetrix.com/support/technical/technotes/statistical_reference_guide.pdf).
Except for Bio B, which is based on internal standards, quality metrics are influenced by cumulative errors. Thus, errors in early stages of processing influence late stages. As a result, QC metrics that measure early stages may in theory correlate with downstream metrics. Some metrics are calculated by novel methods (percent present and scale factor) and are subject to parameter settings (e.g., target signal for scale factor). Further, percent present and scale factor are calculated based, in part, on background. Table 22 gives the correlation matrix for 5167 arrays (one array had no Bio B measurement)
We observed that scale factor (the normalization constant) correlates best with percent present and Bio B (Table 22).). All three of these measures are sensitive to the brightness of an array. Bio B is a prelabeled spike in control so that it’s correlation with the other brightness measures is a function of either the hybridization process or the laser scanning. Note that percent present calls are negatively correlated with each QC metric. This is expected, since percent present increases as quality increases whereas the other measures increase as quality decreases. Background correlates well with percent present and Bio B but not with scale factor or actin 3′/5′ ratio. The actin 3′/5′ ratio does not correlate well with any other QC metric.
Figure 11 demonstrates that the observed QC metrics exceeded the theoretical ranges. Negative and extremely high values are unexpected for Bio B and actin 3′/5′ ratios. However, they are observed, although with low frequency. Negative signal values do occur with MAS 4.0 algorithm. This fault does not occur in MAS 5.0.
Figures 22–6 are box and whiskers plots,15 in which the box defines the interquartile range, the center line is the median, the dots are outliers, and the lines represent the first adjacent observations (i.e., the first non-outlier observation).
Actin 3′/5′ ratios do not improve over time (Fig. 2A2A)) globally or for the HG_U95Av2 arrays alone (Fig. 2B2B).). For U133A quality appears flat, except for one quarter where the ratios are unusually high (Fig. 2C2C).). Again, for the mouse U74Av2 no clear trend is evident (Fig. 2D2D).). ANOVA results indicate that the chip type and algorithm do not explain more than 1% of the variance [R2 = 0.01, F(20, 5166) = 2.77]. The interaction between chip and algorithm is not a statistically significant predictor of actin 3′/5′ ratios.
The scale factor does stabilize over time (Fig. 3A3A).). The trend is a lower frequency of values in excess of 100 over time (Fig. 3B3B).). For all arrays, scale factor produced extreme values (in excess of 1000) until the statistical algorithm was employed. These arrays use a set target value of 500 as a default. The ANOVA results explain 13.4% of the variability in scale factor due to chip type, algorithm, and their interaction (R2 = 0.134, F(38, 5167) = 20.95). If only arrays with a target value of 500 are tested, then chip type alone explains 16.2% of the variance (R2 = 0.162, F(14, 3285) = 45.07). Examining those data on a log scale (Fig. 3C3C)) shows that the range of scale factor is decreasing but the median does not have a clear trend. If only U133A arrays set to target 500 are examined on a weekly basis, there is a significant negative linear trend on the scale factor over time on a log scale [F(1, 869) = 195.64, R2 = 0.18, beta coefficient = −0.02).
Average background has a clear downward trend when viewed for all arrays over time (Fig. 4A4A).). When viewed over a log scale this trend appears stepwise (Fig. 4B4B).). Figure 4C4C shows all data color coded by algorithm. Figure 4D4D shows that the weekly trend for background human U133 arrays is downward. Linear regression on logscale data confirmed this trend [F(1, 869) = 128.76, R2 = 0.1291, beta coefficient −0.001586 (log10 scale)].
Figure 5A5A indicates that the improvement in background follows changes in algorithm for all arrays. However, the trend for U95Av2 alone is less striking (Fig. 5B5B).). ANOVA results indicate that the interaction of chip and algorithm is a marginally better predictor [F(38, 5167) = 108.79, R2 = 0.4463] than the two-way ANOVA of chip and algorithm [F(20, 5167) = 199.23, R2 = 0.4363). If a three-factor model is tested with the interaction, chip, and algorithm, then algorithm is no longer a significant factor. In summary, the interaction between chip and algorithm is sufficient to explain 44% of the variability in average background.
Figure 6A6A displays the Bio B values for all arrays. Figure 6B6B shows the Bio B data from all arrays except two arrays with negative values. Figure 6C and DD shows that Bio B improves in global trends for U133A and U74Av2. ANOVA results of a full model with chip, algorithm, and the interaction explained 15% of the variability of Bio B [F(38, 5166) = 24.06, R2 = 0.1513). All factors were significant.
Figure 7A7A demonstrates that percent present does not have a clear pattern over time for all arrays. Figure 7B7B plots the same data but color codes each algorithm demonstrating a discontinuity in the data. ANOVA model of chip algorithm and interaction explains 38.7% of the data [F(38, 5167) = 85.43, R2 = 0.3876]. By examining only arrays analyzed by the statistical algorithm an increasing trend is visible (Fig. 7C7C).). This linear weekly increasing trend in arrays analyzed by the statistical method was significant [F(1, 3337) = 473.78, R2 = 0.1243, slope = 0.1393877]. The trend for U133A arrays is also significant (Fig. 7D7D)) [F(1, 869) = 173.81, R2 = 0.1667, slope = 0.1273374].
Outliers were defined as arrays with less than 25% present calls. Figure 8A8A shows that the proportion of outlier arrays is decreasing over time. Figure 8B8B shows a decrease in the proportion of poor arrays for U95Av2 arrays with time. Although the overall proportion is lower than other arrays, the trend is not consistent for U133A arrays. Mouse U74Av2 arrays have a variable but decreasing proportion of outliers over time. Figure 99 indicates that high throughput months do not have high outlier proportions of arrays analyzed by the statistical algorithm. In fact, no case exists where both a proportion greater than 0.2 and a monthly total of 200 arrays are observed. A linear regression of the proportion of outliers for statistically analyzed arrays against the monthly throughput has an R2 of 0.007 [F(1, 3506) = 25.59, slope = −0.0000586], indicating a less than 1% fit and a slope of 0 to 3 significant figures.
Our analysis shows that quality of several key metrics of Affymetrix data generated at the Hartwell Center has steadily improved since the 4th quarter of 2000. The data were derived from human and mouse arrays. A total of 208 arrays from yeast, Escherichia coli, Xenopus, and others were excluded because of the special requirements of parsing their report files and their proportionally small representation in the database. Most of the improvement in data quality is attributable to improved chip design and updated algorithms. There were notable improvements in scale factor, Bio B levels, percent present calls, and the background. The most recent arrays generally produced higher quality results and fewer extreme measurements that indicate array failure.
Interestingly, some metrics were more sensitive to changes in array design and algorithm than others. The insensitivity of 3′/5′ ratios to chip design and algorithm was expected. The ANOVA results indicate that actin 3′/5′ ratios are independent measures of sample quality. The lack of correlation to other QC metrics and the irregular pattern over time supports this view. Furthermore, actin 3′/5′ ratio trends did not improve with time. For human arrays, clinical RNA samples are frequently introduced. Clinical RNA samples are prone to degradation, as sample preservation is more variable. By contrast, mouse RNA samples are collected under laboratory controlled conditions. The actin 3′/5′ ratios of the U74Av2 mouse arrays had no clear trend and no peaks. RNA samples are produced by many laboratories studying divergent tissues and extraction protocols. These sources of variability are not controlled by the core facility.
Other metrics did show modest improvements over time. Bio B is a prelabeled internal standard. Its variation should be independent of in vitro transcription and labeling processes. The correlation of Bio B to other measures as well as its improving trend indicate that postlabeling processes, such as hybridization, washing, and scanning, had improved over time. This may be attributed to increased scientific skill over time as well as physical improvements in the devices and protocols. Improvements in scale factor were likewise modest. The scale factor or normalization constant is effectively a measure of brightness and is subject to both pre- and postlabeling effects. The modest improvements in scale factor and its correlation with Bio B imply that most of the improvements in scale factor are due to postlabeling causes. If prelabeling improvements were a substantial factor, then scale factor should show more improvement over time relative to Bio B.
The most sensitive QC metrics were average background and percent present calls. Background is a function of the cleanliness of the RNA, the quality of the array and each bench pre- and postlabeling protocol. The 44.6% of the variability explained by algorithm, chip design, and their interaction by ANOVA indicates that the improvements in probe selection, array manufacture, and data analysis were substantial. Judging from the actin 3′/5′ ratios we assume that RNA sample quality remained consistent over time. These data independently demonstrate that Affymetrix arrays have improved in quality over time. Furthermore, improvements in background levels for one array design analyzed with a single algorithm again may be attributed to improved skill in the core facility.
Further evidence of laboratory skill is evident from the decreases in the portion of outlier arrays. Figure 99 demonstrates that high throughput and high failure rates never coincide. This result can be interpreted in two ways. Either higher quality enabled higher throughput, or the detection of low-quality arrays led to slow production. In our view, it more likely that those scientists who monitor failure rates reduced array throughput until performance standards were restored.
The data from Figure 88 show that much of the increasing in percent present calls is due to the lowered frequency of outlier arrays. These data are consistent with the interpretation that monitoring the quality of RNA samples has prevented poor samples from being hybridized. It is worth noting that arrays that have extreme QC metrics are frequently excluded from analyses. However, thresholds for quality do vary by study, as some clinical samples are difficult to acquire or replace.
Like average background, percent present calls is a sensitive QC metric that was 38.7% explained by algorithm, chip, and their interaction according to ANOVA. This QC metric is also biologically sensitive to organism and tissue type. Different chip classes (A, B, etc.) also have characteristic percent present calls so that the effect of chip design on percent present calls was expected. Likewise percent present calls are influenced by brightness and background measures so it is also expected that improvements in those measures would influence the results. Improved algorithms also had a strong effect. In fact, an ANOVA of the chip and algorithm interaction alone is able to explain 38.7% of the variability in the data, indicating inseparable and synergistic effect of these improvements. ANOVA of this interaction alone is also sufficient to explain 44.6% of the variability in average backgrounds.
The strength of the statistical interaction of the algorithm and chip type is unusual. Both single factors are expected to improve data quality. However, the strength of the interaction indicates that design of chips and the improvements in algorithms are mutually enhancing qualities. The strength of the interaction indicates that improvements that result from both factors acting together are greater than their sum. This is an expected result, as chip designs are based on the performance of signal intensity.13 The MAS 5.0 Signal algorithm down weights outliers14 and improved probe selection on new designs results in more consistent probe behavior.13 Thus, new arrays are more effectively measured with new algorithms, which are more effective on new arrays. This data set provides independent confirmation of the progress in data quality over time. Thus we expect that the most recent U133Av2 plus arrays, scanners, and the GREX software should produce a similar round of improvements. Currently, there are too few of these arrays in the database to effectively test this hypothesis.
In the future, QC metrics can be incorporated into an automated monitoring system that can alert technicians of problems before 20% failure rates are reached. Another long-term goal is standardized QC monitoring of customized cDNA arrays. These data provide a baseline for future improvements and hint at what can be analyzed if more detailed laboratory data are captured. Maintenance dates for the scanner, upgrades in equipment, changes in personnel, lot number of each array, or fluidic station all could influence quality. In the future, the databases of laboratory information management systems might record these metrics and provide further insight into data quality. Today, it is clear that chip design, algorithms, and laboratory skill are each contributing to improved array data.
I would like to acknowledge Jacques Retief of Affymetrix for assisting in the collection of data and Geoff Neale of the Hartwell Center at St. Jude for essential scientific critiques.