We investigated the degree to which correlation within gene sets is preserved across multiple experiments by analyzing Gene Expression Omnibus [
23] data from two human (HG-U133A, HG-U95A), and two mouse (MG-U74A, MOE430A) arrays. Datasets with 20 or more samples were used, resulting in 202 datasets and 8,656 arrays. For each dataset we calculated the mean correlation for genes within each GO category or KEGG pathway. The results give strong evidence of the reproducibility of internal correlations (low standard errors in Figure ), with strong agreement of gene set correlations between platforms (Spearman cor.: 0.719 - 0.862) (Figure ). Correlation within gene sets is thus closely associated with gene set membership, regardless of experimental condition [
9]. The samples represent a wide variety of experiments, including
in vitro cell lines, sex comparisons, multiple cancer sub-types, tissues, and diseases. Clearly, in contrast to previous assertions [
5,
22], mere correlation of genes within gene sets and pathways is commonplace, and does not necessarily reflect unique features of the experiment at hand. As Figure shows, correlation within gene sets covers a wide range, and is generally higher than correlation of randomly selected genes. The correlations within gene sets are much higher than the correlation of all transcripts on each platform with each other (Figure , red). Of course, genes from different sets can be correlated (even highly so), but we confine our focus here to pre-defined gene sets.
One simple way to describe the effect of correlation is through a standardized enrichment test statistic. Let z be the signed square root of the independence assumption χ2 statistic (see Methods). Under the null hypothesis (no enrichment), z should have approximately mean 0 and variance 1. However, the true variance can be much greater, with an increase proportional to (m - 1)ρ (see Methods), where m is the number of genes in the set, and ρ is the average correlation among genes in the set (excluding self-correlations). Independence assumption methods assume ρ = 0, implying that the variance equals 1. However, if the average internal correlation of a gene set is positive, then the variance of z can be greatly inflated, even if this correlation is modest, depending on the gene set size m.
To empirically demonstrate that variance inflation results in the false significance of gene sets, we randomly permuted the sample labels associated with each of the 202 Gene Expression Omnibus data sets 10,000 times, and then performed a gene set analysis on each permutation using a common independence assumption method. For each permutation a list of "significant" genes with
p ≤ 0.05 (using a t-test or F-test, as appropriate) was created in order to have a meaningfully-sized gene list, and the standard χ
2 statistic was used to assess the significance of each gene set. A gene set was called significant if it had a Bonferroni-corrected (for the number of sets) χ
2 p-value ≤ 0.05. Note that this is a highly conservative threshold, intended to guarantee that no more than 5% of the permutations result in one or more gene sets (falsely) called significant. The gene set null hypothesis is that each gene set has the same pattern and proportion of differentially expressed genes as any other set [
24]. Permutation of the sample labels induces the null hypothesis for each gene, and thus no gene sets should exhibit enrichment for differential expression. Hence, any gene set declared significant is a false positive.
To determine the overall experiment-wise increase in false positives, we counted the number of permutations in which at least one gene set was declared significant, using the Bonferroni-corrected 0.05 threshold. If the independence assumptions were true, no more than 5% of the permutations would give rise to significant gene sets. However, the observed proportion is much higher (Figure ; Additional file
1, Figure S1). False positive rates often exceed 50%, and in some cases are above 80%, and are even higher when the False Discovery Rate (FDR) is used as a multiple comparison correction. Clearly such high false positive rates have serious implications for the conclusions drawn by any study which uses independence assumption methods to provide biological interpretation of microarray data.
In addition, as predicted, we found that variance inflation is directly related to the false-positive rate for a gene set. For each GO Biological Process (Figure ) or KEGG pathway (Additional file
1, Figure S2), we show that the empirical variances of the standardized enrichment statistic (which often greatly exceed 1) are highly correlated with the per-category false positive rates. Specifically, for each gene set, we calculated the variance of the square root of the χ
2 statistic across all permutations, and plotted it against the number of permutations in which that gene set was called "significant" using a Bonferroni correction.
Fortunately, the shortcomings of independence assumption methods can be addressed through the use of resampling-based methods. These methods use the resampling data to construct an empirical null distribution for the gene set test statistics, taking into account the correlation structure between genes and providing a more accurate assessment of statistical significance. Several existing tools use resampling based approaches [
2,
16-
19], either permuting sample labels relative to gene expression, or performing bootstrap resampling [
24]. We empirically investigated the performance of the permutation approach, which we will refer to as "resampling" to avoid confusion with the permutations described above. The sample labels for each data set were resampled without replacement 10,000 times, and the resulting enrichment statistic distribution was used to obtain an empirical p-value for each gene set. For example, if the observed enrichment statistic for a gene set was greater than or equal to 99% of the resampled enrichment statistics, a p-value of 0.01 was assigned to that gene set. As the resampling units are entire arrays, correlation between genes is maintained under resampling. Therefore the effects of correlation within gene sets apply to both the resampled and observed data, properly controlling the false positive rate. To demonstrate that the resampling approach correctly controls the false positive rate, we applied the resampling procedure to all permutations of all 202 datasets, using the same χ
2 statistic as above. Using the empirical distribution of test statistics under resampling to generate the gene set p-value, proper control of the false positive rate was obtained (Figure ; Additional file
1, Figure S1, right-most boxplots).
While the analyses above clearly show that gene correlation increases the false positive rate for independence assumption methods, it may be tempting to argue that correlation does not affect false positive rates in real datasets, which presumably include true enrichment. To investigate this, we computed the proportion of observed datasets in which each gene set was declared significant by the independence assumption method, and compared these values to the previously generated proportions observed under random permutation. The results for GO Biological Process categories (Figure ) and KEGG pathways (Additional file
1, Figure S3) demonstrate a strong relationship. In other words, categories declared statistically significant by independence assumption methods in actual datasets are often declared significant when treatment assignments are made purely at random. This strongly calls into question the biological conclusions drawn from such analyses, as we have demonstrated that such correlation is a predominant source of misinterpretation, rather than a biological confirmation. Indeed, some gene sets have such high false positive rates (Figures and ) under independence assumption that one or more sets may be found significant in multiple studies, further enhancing apparent biological confirmation.