|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Recently, many univariate and several multivariate approaches have been suggested for testing differential expression of gene sets between different phenotypes. However, despite a wealth of literature studying their performance on simulated and real biological data, still there is a need to quantify their relative performance when they are testing different null hypotheses.
Results: In this article, we compare the performance of univariate and multivariate tests on both simulated and biological data. In the simulation study we demonstrate that high correlations equally affect the power of both, univariate as well as multivariate tests. In addition, for most of them the power is similarly affected by the dimensionality of the gene set and by the percentage of genes in the set, for which expression is changing between two phenotypes. The application of different test statistics to biological data reveals that three statistics (sum of squared t-tests, Hotelling's T2, N-statistic), testing different null hypotheses, find some common but also some complementing differentially expressed gene sets under specific settings. This demonstrates that due to complementing null hypotheses each test projects on different aspects of the data and for the analysis of biological data it is beneficial to use all three tests simultaneously instead of focusing exclusively on just one.
Supplementary information: Supplementary data are available at Bioinformatics online.
In recent years, there has been a considerable shift in attention from single components of molecular biological systems towards studies focusing on functionally related compartments. The reason for this change can be acknowledged, at least partly, to the fact that today's systems perspective (Kitano, 2001; Palsson, 2006) is generally considered as very beneficial to elucidate the collective functioning of biological processes and even of whole cells. In this context, it is no surprise that this reflects also in recent developments regarding the analysis of gene expression data (Emmert-Streib and Dehmer, 2008) and more specifically endeavors to detect differentially expressed pathways (Barry et al., 2008; Emmert-Streib, 2007; Mootha et al., 2003). The analysis of pathways (gene ontologies, or pre-selected gene sets) that are significantly differentially expressed between two phenotypes is intuitively appealing and there are two known reasons for this. First, by arranging genes into pathways the dimensionality of the data is reduced and as a consequence the number of statistical hypotheses to test. Second, the statement ‘a gene is differentially expressed between two phenotypes’ has less explanatory power compared to the statement ‘a pathway is differentially expressed between two phenotypes’. However, the idea to look for differentially expressed pathways (gene sets in what follows) appeared with a different reasoning in mind. There is a general belief that in metabolic diseases changes in gene expression are moderate and cannot be detected for individual genes. For example, after correction for multiple tests there were no differentially expressed genes between type II diabetes positive and negative patients (Mootha et al., 2003). In contrast, the search for differentially expressed gene sets identified a set of genes involved in oxidative phosphorylation as coordinately decreased in human diabetic muscle (Mootha et al., 2003). In the latter work, Mootha and colleagues described the first algorithm (‘Gene Set Enrichment Analysis’, GSEA) focused on the expression changes in a set of genes as opposed to changes in the expression of individual genes. Since that time many approaches for the analysis of gene sets have been suggested (Kim and Volsky, 2005; Nettleton et al., 2008; Tomfohr et al., 2005) and their number is still growing (see Ackermann and Strimmer, 2009 for a review). The major difference between them was formulated by Goeman and Buhlmann (2007) in terms of the scope of the comparisons of these approaches. Competitive tests compare a gene set against the rest of all sets and self-contained tests answer the question whether two gene sets are differentially expressed between different phenotypes. In what follows we concentrate on the self-contained tests only [see Goeman and Buhlmann (2007) for further discussion]. Self-contained tests, in turn, are different in terms of whether they are multivariate and account for interdependencies among genes (e.g. Hotelling's T2 test: Kong et al., 2006; Lu et al., 2005; Xiong, 2006; GlobalANCOVA: Hummel et al., 2008; N–statistic: Klebanov et al., 2007), or disregard existing complex correlation structure in a gene set and consider gene-level statistics only (e.g. weighted sum of t-tests: Tian et al., 2005; median-based or sign-tests: Jiang and Gentleman, 2007). Furthermore, for gene-level statistics a transformation of the test statistics is frequently used, to account for the presence of up- and down-regulated genes in a gene set (Ackermann and Strimmer, 2009). More importantly, for univariate and multivariate self-contained tests the underlying statistical hypotheses are different. For example, Hotelling's T2 tests the equality of two multivariate mean vectors while N-statistic tests the equality of two multivariate distributions. A combination of univariate statistics (either transformed or not) studies whether the aggregate gene-level test score differentiates between two phenotypes (Jiang and Gentleman, 2007). We want to emphasize that due to these complementing null hypotheses each test projects on different aspects of the data.
To get the most out of the many tests available one needs to know their relative power in different settings and account for different hypotheses they test. In this article, we compare the performance of univariate and multivariate tests on simulated and biological data with three questions in mind. First, not all genes in a gene set are expected to change their expressions between different phenotypes. The percent of genes changing their expression in a gene set, in the way that the entire gene set is called differentially expressed (‘detection call’), is an important but currently unknown characteristic of a test performance. Second, genes in a gene set are functionally related and have complex correlation structure. Multivariate tests might have better power because they account for interdependences among genes considering the joint distribution of gene expression levels, in contrast to univariate tests, which test differences in the marginal distributions, but this hypothesis requires confirmation. The third question is an implication of the second: one might expect that because univariate and multivariate statistics test different null hypotheses that for real biological data they may result in completely different gene sets. There is a reason for concern here, because for example the application of Principal Component Analysis and gene-level tests resulted in a similar scenario (Jiang and Gentleman, 2007). In this article we answer the first two questions on simulated data, mimicking the stated conditions and study in detail the third one on two biological data sets.
In our analysis we compare five statistical tests, four tests representing popular choices in testing whether two gene sets are differentially expressed between different phenotypes (though testing different statistical hypotheses) and one test which has never been used in the context of gene expression sets before. We include two multivariate tests, Hotelling's T2 test (Lu et al., 2005), and N-statistic (Klebanov et al., 2007), also known as non-parametric Cramer test (Baringhaus and Franz, 2004) and two univariate tests (different transformations of the t-statistic, see below). A fifth test, the multivariate Dempster's T1 (Dempster, 1958) was recently considered by Srivastava and Du (2008) in the context of power study (see Srivastava and Du, 2008 for further details). For biological data we compare the performance of these tests with other popular approaches (Jiang and Gentleman, 2007; Liu et al., 2007).
Consider a pair of biological conditions, such as ‘case’ versus ‘control’ or ‘treated’ versus ‘untreated’. Suppose there are n1 samples of measurements of p genes for the first, and n2 samples of measurement of p genes for the second conditions. Let the two p-dimensional random vectors of measurements X1,…, Xn1 and Y1,…, Yn2 be independent and identically distributed with the distribution functions F, G, mean vectors and p × p covariance matrices Sx, Sy. We consider the problem of testing the hypothesis H: F = G against a fixed alternative F ≠ G.
Hotelling's T2 does not test the hypothesis H: F = G. If F and G are multivariate normal distributions with common covariance matrix, the Hotelling's T2 tests the simplified hypothesis . The corresponding test statistic is
where S is the pooled variance–covariance matrix of the measurements. The problem, however, is that if p is larger than n1 + n2 − 2 the covariance matrix becomes singular and cannot be inverted. In gene expression studies the number of samples is always much less than the number of observation and to calculate the inverse one needs to use additional steps. One obvious modification is to use a generalized matrix inverse instead of S−1, and another one is the dimensionality reduction (e.g. Kong et al., 2006). Yet another possibility is to use the shrinkage estimator by Schafer and Strimmer (2005). We did not find significant differences between results obtained using the generalized matrix inverse (as implemented in Venables and Ripley, 1999) and the shrinkage estimator; in what follows the generalized inverse is used.
Here we consider only L(X, Y) = ‖X − Y‖, the Euclidean distance in Rp. In the original papers several other kernel functions L were suggested as well (see Baringhaus and Franz, 2004; Klebanov et al., 2007).
Tian et al. (2005) suggested averaging the values of t-statistic for individual genes and testing the hypothesis of no association between gene sets and phenotypes with label permutations. This approach again tests the simplified hypothesis of no differences in mean expressions between two phenotypes. Here we consider two sign independent statistics: average absolute values of t–statistic and average squared t-statistic
because the possibility that the same gene set has to include both, highly up- and down-regulated genes cannot be excluded.
We simulated two samples of equal size, N/2 from the p-dimensional normal distribution N(0, Σ) and N(μ, Σ) representing two biological conditions with different outcomes under the following global and local settings.
For every fixed local setting (A or B) the samples were simulated under all varieties of global settings, giving in sum 3 × 3 × 3 × 2 = 54 different simulated data sets.
In order to assess how good the five tests under investigation control the Type I error rate we estimate it numerically from 1000 replications of the data set. We estimate the Type I error by the observed proportion of 1000 replications of the data set, simulated under the null hypothesis, where the alternative hypothesis was falsely accepted. We also estimate the empirical power by the observed proportion of 1000 replications of the data set, simulated under the alternative hypothesis, where the null hypothesis was correctly rejected. For all of these simulations we assumed a significance level α = 0.05.
To study the performance of different statistics on real biological data we consider two biological examples, frequently used in the literature with regards to differentially expressed gene sets.
The first example comprises 50 samples of NCI-60 cell lines differentiated based on the status of the p53 gene: 17 cell lines carrying normal p53 gene and 33 cell lines classified as currying mutated p53 (Subramanian et al., 2005). This data set was also analyzed by Liu et al. (2007), to compare the performance of three methods, Global Test, ANCOVA Global Test and SAM-GS (see Liu et al., 2007 for detail). Expression data and pathways (C2 functional gene sets, as defined in Subramanian et al., 2005) were downloaded from GSEA web site (‘Molecular Signature Database’). We excluded pathways with <15 genes, and studied differential expression of 369 sets. Data were normalized using Variance Stabilization (Huber et al., 2002) as recommended (Liu et al., 2007).
The second example is a large data set from a clinical trial of ALL. Similar to Jiang and Gentleman (2007) we considered only two groups of patients with ALL: those having BCR/ABL fusion gene (37 cases) and those tested negative for this fusion (42 cases). Data were preprocessed as described (Jiang and Gentleman, 2007). As gene sets we considered KEGG (Kanehisa and Goto, 2000) pathways with more than 10 members.
For all statistics P-values were obtained from permutations (1000). For the sake of comparison to the results of Liu et al. (2007) and Jiang and Gentleman (2007), we fixed the same threshold for all P-values (0.001), followed as much as possible to their data preprocessing steps and did not consider correction for multiple testing. The later helps to avoid selection of a specific multiple testing procedure, influencing the results significantly (Dudoit and van der Laan, 2008). In these settings we present all pathways, found differentially expressed by at least one out of four tests (due to the similarity of Dempster's and Σt,2 in the simulation studies we dropped the former test for biological data. This similarity is actually expected, because Dempster's test is not truly multivariate in a sense it does not account for a comlex correlation structure).
Table 1 presents the results of our simulations to estimate the attained significance levels of the five tests. As can be seen all tests provide rather good estimates of α = 0.05 when μ = 0 under different parameter settings. It should be noted that Hotelling's T2 always provides slightly conservative estimates of Type I errors, while all other tests are more liberal.
Figures 1–3 show the power curves of multivariate and univariate tests under the local setting A. Generally, among all factors (namely dimensionality, detection call gamma and pairwise correlations), the correlations impact the power of the tests in the most effective way.
When the pairwise correlation is set to 0.1 (Fig. 1) the power of two univariate tests namely Σt,1, Σt,2 and two multivariate tests, namely N-statistic, T1, is virtually the same (the power of Σt,1 is lower when the detection call is equal to 0.25; Fig. 1, left-most column). The pathway's dimensionality slightly influences the slope of the power curve. When the detection call is set to 0.5 all four statistics reach ~100% power when the mean expressions are 0.5 different given the pathway size is 100, 90% power given the pathway size is 60 and 80% power given the pathway size is 20 (Fig. 1, middle column). The detection call influences the power in a similar manner. When the pathway size is set to 60, the power of all four statistics reaches ~100% power when the mean expressions are 0.5 different given the detection call is 0.75, 90% power given the detection call is 0.5 and 70% power given the detection call is 0.25 (Fig. 1, middle row).
When pairwise correlations are small (0.1) in all settings Hotelling's T2 has lower power in comparison to other tests. The dimensionality of the pathway influences the power of T2 in the similar way as the power of all other statistics. In contrast, the detection call influences the power of T2 to a smaller extent: to reject the null hypothesis T2 needs only several components of the mean vector to be significantly different (see below).
When the pairwise correlations are set to 0.5 or 0.9 (Figs 2 and and3)3) the power of all statistics is generally lower, but the power of Σt,1 is influenced the most when the detection call is set to 0.25. For the pathway size of 100 and a detection call of 0.75, the power curves for other four statistics are nearly identical (Figs 2 and and3,3, upper rows). However, under the presence of correlations even the best-performing N-statistic reaches the power of 100% only when the mean expressions are 1.5 different.
An interesting observation is that there is a narrow area of parameter values where Hotelling's T2 is the best statistics (Figs 2 and and3,3, top left). For higher correlations (>0.1) low gamma (<0.5) and large gene sets (>60) there are intervals of mean differences for which Hotelling's T2 slightly outperforms N-statistic. The difference in power is small but we will see in the results section for the biological data that this effect is relevant. Aside from these special parameter settings Hotelling's T2 gives almost always the lowest power compared to all other tests.
Figure 4 shows the power curves of multivariate and univariate tests under the local setting B. Only N-statistic has the power to test the full hypothesis F = G against a fixed alternative F ≠ G (Fig. 4). All other statistics have no power at all. This is expected, because the other tests were designed for testing different hypothesis. For a pathway, changes in the variance of the pathway given the same mean vector might indicate the presence of differential regulation under different phenotypes, and should be as interesting as changes in the average expression itself. It appears that this case can be detected by the N-statistic while all other test statistics are insensitive.
In sum, different statistics detected 19 gene sets at the given significance level (P <0.001), differentially expressed between cancer cell lines with and without p53 mutation (Table 2). Among them, 13 were also reported in Liu et al. (2007) study and six were found in this study for the first time. We note that the sum of squared t-statistics has the highest power [13 significantly differentially expressed pathways, Table 2, Fig. 5a] and the result of Σt,1 is a subset of the result of Σt,2.
While the results of N-statistic and two univariate tests almost coincide at the more liberal significance level (e.g. P < 0.01), for several pathways P-values from Hotelling's T2 are rather high. On the other hand, four gene sets (Cell_Cycle, cell_cycle_checkpointII, DNA_damage_signalling and CR_CELL_CYCLE) were reported exclusively by Hotelling's T2 and cannot be detected by other statistics even at the 0.01 liberal threshold (except DNA_damage_signalling, detected at 0.01 threshold by Σt,2). These sets represent the major of the functional targets of p53 activity, namely the regulation of cell cycle progression and DNA damage signaling, and include p53 itself and its multiple targets. Therefore, one may consider the failure of other statistics to identify correctly these sets as a false negative error. In what follows we provide plausible, yet empirical explanation of the reasons behind sporadically high P-values of Hotelling's T2 and unexpected false negative errors of other statistics.
First, we note that generally when P-values from Hotelling's T2 are high, the distributions of pairwise correlation coefficients in two groups are rather different. For example, for bcl2family_and_reg_network the P-value of Hotelling's T2 is the largest: 0.807 (Table 2). For this set the distributions of pairwise correlations in two groups are drastically different and average pairwise correlations are 0.027 and −0.004, respectively (Supplementary Fig. 1). For two other gene sets with the largest P-values from Hotelling's T2 (radiation_sensitivity and p53hypoxiaPathway, P-values are 0.382 and 0.247, respectively) the situation is similar (Supplementary Fig. 1). It might be that in these cases the assumption of equal covariance for two samples is violated and the power of the test is dropped.
The presence of false negative errors for the other statistics can be explained if we introduce the measure of the percentage of individually changing genes in a gene set. Let us measure this quantity using absolute values of the t-statistic and consider that the gene is changing if its t-statistic is more than 1.96 (the factual value of the threshold here does not matter; it should simply reflect the presence of change). For the gene sets, detected by Hotelling's T2 only (Cell_Cycle, cell_cycle_checkpointII, DNA_damage_signalling and CR_CELL_CYCLE) this measure is much lower (12.3, 12.5, 18.0 and 11.3%, respectively), as compared to other gene sets (Table 2 and Supplementary Fig. S1). Thus the other statistics cannot detect the overall expression changes in a set when only few genes are actually changing, in contrast with the high sensitivity of Hotelling's T2.
There are two pathways, reported by univariate tests only, even if we consider the more liberal significance level (e.g. P < 0.01): radiation_sensitivity and p53hypoxiaPathway. These pathways were also reported by Liu et al. (2007), include p53 and its targets, and should be considered as true positives.
Overall for the ALL data set the four tests gave more homogeneous results compared to the p53 data set, probably because of the larger ALL sample size (79 slides) of the ALL data set. Thirty-five KEGG pathways were found differentially expressed by at least one of the four tests at the given significance level (P ≤0.001) (Supplementary Table 1). Multivariate and univariate tests detected simultaneously eight pathways (Fig. 5b). In four cases, P-values of Σt,1 and in four cases Hotelling's T2 were larger than 0.05 (Supplementary Table 1). As before, the failure of Σt,1 to detect differential expression among sets can be explained by the small percent of individual genes in a set actually changing their expression (Supplementary Table 1 and Supplementary Fig. S2). The failure of Hotelling's T2 test might be related to the violated assumption of equal covariance for two samples (Supplementary Fig. S2). Among 35 sets, 13 were also reported by Jiang and Gentleman (Jiang and Gentleman, 2007) as differentially expressed.
The analysis of differentially expressed gene sets is an effective way to overview the underlying biological trends in gene expression data sets. There is a rich body of statistical tests available for finding gene sets, differentially expressed between two phenotypes. Several comparative studies address the relative performance of these tests (Jiang and Gentleman, 2007; Liu et al., 2007; Song and Black, 2008), one suggests a special treatment of gene sets analyses for prokaryotes (Tintle et al., 2008) and one builds a complete ‘taxonomy’ of the testing procedures (Ackermann and Strimmer, 2009). Here we emphasize that the most fundamental difference among these approaches is formulated in terms of the null hypothesis they test.
We have studied the relative power of popular univariate and multivariate tests for several simulated and two biological data sets. We considered two gene-level statistics, the average of the absolute and the squared t-tests for individual genes in a gene set, and three multivariate statistics, Hotelling's T2, N-statistic and Dempster T1. Although for three of these statistics the tested null hypothesis is different, their relative performance on simulated data is similar. All tests perform reasonably well in estimating the Type I error rate. Among the three parameters varied in simulations (the magnitude of pairwise correlations among gene expressions, the number of genes changing their expression in a set and the size of a gene set) the magnitude of pairwise correlations has the largest influence on the power of all tests. Despite the general belief that multivariate tests, accounting for a complex interdependence structure between genes might have a better power compared to univariate tests this study demonstrates that the use of multivariate statistics does not lead to a substantial gain in power when correlations are present and high. When correlations are low both, the number of genes changing their expression in a set and the size of a set have only slight influence on the power (except Hotelling's T2), in contrast to the case of high correlations. For the poor performance of Hotelling's T2 there are two possible explanations in our opinion. First, the numerical estimation of the generalized inverse may be unstable for conditions relevant for the analysis of microarray data. Second, this may depend directly or indirectly on an altered correlation structure in both conditions because they form the underlying basis for the generalized inverse of the combined covariance matrix. For both reasons additional studies are necessary. On the other hand, the power of Hotelling's T2 rises quickly when the number of genes changing in a set and the set's size increase when correlations are low. In sum, the performance of all tests coincides when the correlations are low, the gene set size is large and the percent of genes changing their expression is high. However, the beneficial combinations of all these factors may rarely happen in true biological data and the performance of these tests might be different for real data set.
The analysis of biological data again demonstrates that there are some aspects in gene expression data that cannot be efficiently modeled. From the Venn diagrams shown in Figure 5a and b one can see that Hotelling's T2 is a more important test for biological data than one might expect from the simulation results in Section 3. This is not only a surprise but strongly indicates that the simulation technique lacks important characteristics from real biological data. Also, for the p53 data set, Σt2 has slightly higher power than N-statistic, while N-statistic always slightly outperforms all other tests on simulated data. On the other hand, for simulated as well as biological data, the results of Σt1 are subsets of the results of Σt2. Similarly, the good performance of Hotelling's T2 when the number of genes changing their expression in a gene set is small is captured in both, simulated and biological data. That is, the simplified model assuming a multivariate normal distribution for gene expressions adequately reflects some, but not all properties of the biological data.
The intersection of significant gene sets, found by different tests in p53 and ALL data is substantial. However, Hotelling's T2 is able to find gene sets which are not discovered by other tests, where only about 11–12% of all genes are changing their expression. This is because Hotelling's T2 involves all variables symmetrically and can equally detect changes for a single variable, for all, or for the subset. One can expect that the sets reported exclusively by Hotelling's T2 constitute false negative errors for other tests in the case of p53 data, because these sets directly include p53 together with its functional targets. We also note that for the p53 and ALL data sets, on the average only 20–25% of genes in the gene sets are actually changing their expression (Table 2, Supplementary Table 1). This observation adds more evidence to the motivation of (Mootha et al., 2003) for studying gene sets instead of individual genes.
Here we studied two univariate and three multivariate self-contained tests for detecting differential expression of gene sets. All these tests can be distinguished by their underlying null hypotheses, leaving only three conceptually different statistical tests with respect to the null hypotheses. It should be noted that these three null hypotheses cover the vast majority of the current universe of self-contained tests. The three best-performing tests for these hypotheses (sum of squared t-tests, Hotelling's T2 and N-statistic), find common but also complementing gene sets differentially expressed. Due to complementing null hypotheses, each test projects on different aspects of the data and for this reason, their simultaneous use for the analysis of biological data leads to an increased power as compared to individual tests.
The authors like to thank Earl Glynn for his comments on the manuscript.
Funding: Grants from NIH (R21HG004648, GM079259) and an Alfred P. Sloan Research Fellowship (to G.G.).
Conflict of Interest: none declared.