Identifying predefined gene sets enriched with DEGs has become a routine part of microarray analysis, and provides investigators with greater biological insight than significant gene lists alone. Our aim was to develop a method that (i) does not require the choice of a significance cutoff, (ii) allows the investigator to choose different methods for detecting DEGs, (iii) provides unbiased assessment of statistical significance and (iv) similar to FE test has an intuitive interpretation in terms of odds ratios. The method we developed, LRpath, uses logistic regression to model the relationship between gene set membership and differential expression in terms of odds ratios of enrichment. The basic question addressed by LRpath is whether the odds of a gene belonging to a predefined gene set increases as the significance of differential expression increases. Unlike the χ2-type of methods, our model allows the data resulting from tests of differential expression to remain on a continuous scale. This removes the need to choose a significance cutoff, and has the advantage of taking into account the distribution of significance levels for genes not belonging to, as well as belonging to, the gene set of interest. If expression of genes from a specific biological pathway is affected in the experiment, we would expect that genes with significant P-values are more likely to be members of this pathway than genes with less significant P-values, although we may not know exactly where, or want, to draw a line between ‘significant’ and ‘non-significant’ differential expressions.
Led by the communication from Michael Newton about underlying similarities between the logistic regression and the random-sets framework, we further examined the relationship between the two methods. Our results indicate that the differences between the two procedures, when using the same score are very small. Actually, it can be shown that the random-sets method of allez is nearly identical (see Supplementary Material
) to performing logistic regression and using the score test for significance, which is asymptotically equivalent to the statistical test used by LRpath. Thus, small observed differences could be due to small differences in the performance of the score and the Wald tests in this context of logistic regression. Regardless of which of the two procedures are used, one seems to be better off using −log(P
-values) instead of using the z
We performed an in-depth comparison with other relevant methods using both simulated and experimental data. In the simulation study, we know the exact truth about enriched GO terms, but the simulated data lacks the natural correlation structure found in experimental data and may have unrealistic distributional properties. On the other hand, the exact truth is unknown in the breast cancer microarray data we analyzed, but the other issues are appropriately addressed. This comparison of independent experimental datasets is both inherently free of bias, and addresses the question of which methods provide the most reproducible results. The observed concordance of the results in these different analyses offers strong evidence that our conclusions are valid and reproducible. Results of our simulation study indicate that, as expected, the power to detect enriched GO terms depends greatly on the level of enrichment, and to a lesser extent on several other parameters tested. For FE test, we conclude that both the significance cutoff used to define DEGs and the test used to detect DEGs (data not shown) affect the results of gene enrichment testing. Overall, LRpath performed better than the other methods tested based on all criteria.
Using the concordance between the two independent larger sample breast cancer datasets, we showed that LRpath again resulted in the best performance. The results from these analyses were generally in agreement with results of the simulation study. In both cases, allez and ProbCD performed favorably, and FE with a fairly relaxed cutoff also performed well. ProbCD may offer an additional advantage in situations when the gene set assignments are probabilistic. Newton et al. showed that selection methods (e.g. FE test) and averaging methods (e.g. allez) each have a ‘domain of superiority’ in the space of possible enrichment problems. In general, averaging methods are superior when the differential expression effects are relatively small and are most advantageous when the enrichment level is also high. Based on our results for the breast cancer experiments, it seem likely that these criteria held true, and that the disadvantage of the averaging method can be at least partially offset by using an input measure, such as −log(P-value), that allows a smaller number of highly DEGs to help drive the enrichment process. Further comparisons will be necessary to assess the performance of other gene score measurements as input, such as log-fold change or t-statistics, and to what extent their performances are dataset dependent.
Using the breast cancer (Miller et al
; Sotiriou et al
) and IPF (Pardo et al
) datasets, we also uncovered novel insights into the biological mechanisms of these diseases. In breast cancer, we demonstrate the use of LRpath and other methods to detect consistent GO terms distinguishing histologic grade 3 and grade 1 primary breast tumor samples from two independent datasets. In IPF, we demonstrate the use of LRpath to detect over-represented biological categories not presented in the original analysis and which would not have been identified by FE test (i.e. identifying additional pathways including altered ‘Cell cycle’, decreased ‘Blood vessel development’ and a decrease in ‘Cytokine−cytokine receptor interaction’.)
We have implemented LRpath as an R function (Ihaka et al
, and Supplementary Material
) which can be downloaded along with all other Supplementary Material
from our supporting website http://eh3.uc.edu/lrpath
. The function is designed to automatically test the categories of GO terms or KEGG Pathways, but can be modified for use with user-defined categories. Current implementation accepts as input significance statistics of the investigator's choice and allows for duplicate and missing gene identifiers.