Microarray experiments allow one to examine global patterns of gene expression, but by their nature involve multiple comparisons that can generate false positives. While the idea of removing probe sets that are unlikely to produce positive results is not new, we present a systematic analysis of the effects of several strategies. Not all genes are expressed in any one tissue [3
]. Probe sets that have very low signals or are called Absent primarily reflect noise in the data, and give a large number of false positives without adding many true positives. Permutations expected to produce no significant changes confirmed that Absent probe sets have an increased risk of producing false positives (Tables and ). Requiring that only one treatment group meet the threshold, particularly with our recommended filtering by Fraction Present, preserves data for genes that are turned on or off, genes that may be of great interest to biologists.
Filtering by fraction Present does a better job of removing most of the Absent probe sets while retaining most of the Present probe sets than filtering by either average MAS5 signal or RMA value (Figs. , and ), and results in much better FDR (Fig. , ). Our main evaluation criteria, improvement of FDR, was chosen because this experiment-wide measure of confidence is widely applied. Because there is no "gold standard" for real experiments, we used measures that increased the likelihood of a result being a true positive, such as p-value < 0.001 for a Welch's t-test and consistency of detecting the difference in multiple permutations. Fraction Present filtering removes very few probe sets with p ≤ 0.001 (Fig. and , Tables and ); it does not remove probe sets that are turned on or off unless the threshold is set above 50% Present (Table ). Unlike using signal or RMA values, thresholds chosen for fraction Present are not affected by chip type, percent called present, method of scaling or normalization, nor by the method used to produce the expression value (e.g. MAS5, RMA).
Permutation of the IFN data to simulate smaller experiments also showed that the Absent probe sets generated a disproportionate number of false positives (Fig. , Table ) many of which had fold changes larger than 2, showing that fold-change alone as a filter cannot fix this problem.
Filtering increased the average number of probe sets that met an FDR of 0.1 in the IFN data for experiments of all sizes (Fig. ), and was particularly helpful for the smaller experiments: over 3-fold improvement for the 3 sample experiments (38 to 122) and nearly double for the 4 sample experiments (378 to 672). Small experiments (3–4 samples) have limited power to detect changes, and very few probe sets can be consistently identified (Fig. ). Filtering by fraction Present greatly improves FDR even for small experiments, and retains nearly all of these reproducible probe sets (Figs. , ). While we think that experiments should use more than 3 or 4 samples, this filtering method should improve results from small experiments such as pilot projects.
Figure 9 Effect of experimental size on number of probe sets meeting a fixed value of FDR before and after filtering. The number of probe sets meeting various Benjamini and Hochberg FDR thresholds, 0.2 (blue), 0.1 (red), and 0.05 (green) before (open symbols) (more ...)
Pavlidis, et al.
] demonstrated that 10 to 15 samples (or fewer in some cases) produced reproducible results, as determined by their stability measures of order and recovery. Our study demonstrates that even large experiments benefited from filtering by fraction Present; The IFN data with 10 samples and the smoking data with 20, both had improvements in FDR after filtering (Fig. ), with approximately a 50% improvement in FDR when using a fraction Present of 0.25 for filtering.
Removing only those probe sets called Absent in all samples provides the single largest improvement in FDR and appears to be sufficient for large experiments. Although the FDR is somewhat better with more stringent filtering (Fig. ), the loss of probe sets at p ≤ 0.001 indicates there may be an accelerated loss of true positives in the larger data sets (Table ). as the experiment size decreases, the criterion for filtering should be increased (Fig. , ; Table ). For data sets with 3–4 samples 50% Present spares most of the probe sets significant at p ≤ 0.001 and those probes sets found most consistently (Fig. , ). For more samples, relaxing the threshold to 25% fraction Present is reasonable (Fig. , , and ). Requiring 100% Present in one of the two treatment groups is not recommended, because it removes too many highly significant probe sets (Tables and ) and removes a large portion of the probe sets turned on or off (Table ) in experiments of any size.
The results are similar for datasets that differ greatly. The IFN data, presented in the most detail, was from an experiment that examined the effects of interferon treatment on human PBMC in vitro
], and used the HGU133A GeneChip®
. The vitamin A data compared RNA extracted from liver tissue from Sprague-Dawley rats fed vitamin A deficient or sufficient diets [15
], and used the relatively old RGU34A GeneChip®
, designed with much less sequence data and informatics. The smoking data are from a large study examining differences in bronchial epithelia extracted from human subjects [22
], and also used HGU133A GeneChip®
; the variability within each group in the smoking dataset is much greater than in the others. The Pearson correlation between samples from subjects within each of the two groups in the smoking data were 0.87 and 0.89, compared to an average Pearson correlation >0.97 for samples within each group for the IFN data. In all three cases, representing different generations of GeneChip®
, different species, different laboratories and different amounts of intra-group variability, our approach achieved the primary goal of improving FDR while minimizing the removal of very significant probe sets (p ≤ 0.001) and retaining those probe sets turned on or off.
Filtering based on the fraction of Present calls is superior to methods based on signal or RMA value because it is more likely to preserve probe sets turned on or off and it removes probe sets that show cross-hybridization. Filtering by fraction Present is also much easier to implement, because general guidelines can be set based upon the experiment size instead of having to examine the distribution of signal values; the variability of the signal distributions for different datasets is such that no average signal value gives comparable results across all datasets (Table ). Although the detection call is generated by MAS5, this method can be used as a pre-filter to improve results using non-MAS5 generated data, such as RMA.