Recommended statistical framework
From the motivating MDD example, we showed in Figure a diagram of the statistical framework to consider potential confounding covariates, paired design and gene-specific variable selection in the meta-analysis modelling. The framework consisted of four major steps: individual study analysis, meta-analysis, pathway analysis and post hoc analysis. In the first "individual study analysis" step, collinearity of confounders was assessed and RIM_minP or FEM_minP method with variable selection was applied depending on paired or un-paired design. One or multiple meta-analysis methods were applied and compared in Step II. Pathway analysis was then performed on the detected DE gene list(s) to identify enriched pathways in Step III. Finally, post hoc analysis was performed to summarize importance of each confounding variables and to evaluate the consistency of disease effects and confounders' effects across studies. This framework is general and applicable to any multi-study weak-signal data from a complex disease similar to the motivating MDD example.
Comparison of various methods in single study analysis
Confounder adjustment and variable selection to improve DE gene detection For each single study analysis, we compared the number of detected DE genes under different p-value thresholds (p = 0.001, 0.005, 0.01 and 0.05) from different methods. In Figure , RIM_minP and RIM_BIC detected more DE genes than RIM_ALL in most studies, showing the fact that variable selection helped to ignore irrelevant clinical variables when sample size was small. Among the two variable selection methods, RIM_minP detected more genes than RIM_BIC, supporting that the focus of RIM_minP to obtain the most significant disease effect outperformed RIM_BIC's focus for best model fitting in this MDD example. Under p = 0.005, RIM_minP detected (0.8 to 1.3) times of DE genes than RIM_BIC and (0.8 to 5.5) times than RIM_ALL. The result suggested that RIM_minP is the most effective method in this data set to incorporate confounding variables in the model. In Figure , RIM_minP and RIM_BIC were also compared to paired t-test (PT) and were found to detect more DE genes, showing the advantage of incorporating confounding covariates in the model. RIM_minP identified (0.9 to 4.6) times DE genes than PT and RIM_BIC identified (0.8 to 4.4) times DE genes than PT under p = 0.005.
Paired design to improve DE gene detection To evaluate the improvement of including paired design in the model, we compared RIM_minP and FEM_minP in Figure . We observed increased DE gene detection competency of RIM_minP compared to FEM_minP in three studies (MD2_ACC, MD3_AMY and MD3_ACC) but not in the other two studies (MD1_ACC and MD1_AMY). The result showed that pairing cases to controls by age, race and sex often helped increase statistical power but not always.
Conclusion Incorporation of potential confounding covariates with variable selection and considering paired design in the model of single study analysis generally increased the detection competency of disease related biomarkers.
Comparing individual study analysis and meta-analysis
Smaller sample size in each study often results in a smaller statistical power of DE gene detection. In Table , the first five columns show the number of biomarkers detected by RIM_minP, RIM_BIC and PT under different false discovery rate (FDR) thresholds. After multiple comparison correction by Benjamini-Hochberg procedure, the first four single study analyses detected almost none DE genes under FDR = 5, 10 or 15%. This motivated us to perform meta-analysis to increase statistical power and provide validated findings on DE gene detection. In Table , the last two columns show the number of biomarkers detected by Fisher method and maxP method, respectively. The two meta-analysis methods detected more biomarkers than individual study analysis based on RIM_minP except for MD3_AMY.
To further evaluate the biological implication of the detected DE genes by various methods, pathway analysis was performed. Figure showed boxplots of the minus log-transformed p-values (base 10) from pathway analysis in the top 100 disease-related surrogate pathways. DE gene detection methods were ordered by the median of the log-transformed p-values in the plot. The squares and numbers in the upper part of the figure demonstrate the p-values from Wilcoxon signed-rank test when comparing two neighbouring DE gene detection methods (numbers show the p-values and filled squares represent that the corresponding p-value is smaller than 0.05). The result showed a clear pattern that both meta-analysis methods generally produced better DE gene detection results than the five single study analyses, no matter PT, RIM_minP or RIM_BIC was used in single study analyses. Interestingly, although MD3_AMY generated more DE genes than that produced by meta-analysis methods (Fisher and maxP) using either RIM_minP or RIM_BIC (see Table ), its pathway analysis performed worse than the two meta-analyses result and even worse than the other four single studies (Figure ). This evidence may raise concern of quality in the MD3_AMY study that will need additional investigation. RIM_BIC and RIM_minP did not appear to generate more biologically validated results than PT, probably because of the currently limited knowledge of MDD neurobiology and the still largely inaccurate pathway information.
Comparing fisher and maxP
In the literature, many microarray meta-analysis methods have been proposed and compared [
11,
34,
38]. As was discussed in the method section, different methods have different strength for detecting different types of differentially expressed genes. In Li and Tseng [
8], genes that are differentially expressed in all studies were termed as HS
A type (hypothesis setting A) while genes differentially expressed in at least one study were called HS
B type. Among the three methods compared in this paper, maxP were methods that detect HS
A type DE genes, while Fisher's method detected HS
B type DE genes. The result showed that the two meta-analysis methods detected different sets of DE genes, suggesting different algorithms and assumptions behind the methods. Additional file
1: Figure S3 shows heatmaps on genes detected by Fisher alone (Additional file
1: Figure ), maxP alone (Additional file
1: Figure S3B) or both (Additional file
1: Figure S3C). In Additional file
1: Figure S3A, majority of DE genes detected by Fisher but not by maxP were dominated by strong differential expression in one or two studies (many in MD3_AMY and some in MD2_ACC or MD3_ACC). Although Fisher's method has been popularly applied in the microarray meta-analysis literature, the result showed its weakness to be dominated by single strong signal studies that included potential false positives. For example, Fisher's method detected 810 DE genes among which 445 DE genes (about 55%) could also be detected in MD3_AMY) while only 169 (about 24%) among 683 DE genes detected by maxP method could be detected in MD3_AMY. On the other hand, maxP had better statistical power to detect many genes with weak DE evidence in all studies (Additional file
1: Figure ) that Fisher's method cannot detect. Conceptually, we were more interested in identifying genes differentially expressed across all studies through maxP.
Post hoc analysis for confounding covariates
To evaluate the impact of covariates on the gene expression values and degree of confounding with disease effect, especially among DE genes, we counted the number of appearances of covariates in the RIM_minP model for DE genes detected by maxP method under FDR = 15%. We calculated the rank of each covariate in each study and computed rank averages of each covariate to indicate relative degree of frequency that a covariate impacted gene expression and confounded with disease effect (see Table ). PMI (appeared in 13-20% RIM_minP models of 683 DE genes) and pH (appeared 12-30% in RIM_minP models) consistently had high rank, indicating that they seldom confounded and influenced the disease effect estimate. Suicide (appeared 22-50% in RIM_minP models), alcohol (appeared 29-54% in RIM_minP models) and antidepressant (appeared 17-53% RIM_models) were three factors that consistently ranked among the most influential factors. Finally, age (appeared 21-32% in RIM_minP models) was an intermediate confounding factor. The ranking of MD3_ACC and MD3_AMY was highly correlated (Spearman correlation = 0.89) and the correlation between rankings of MD1_ACC and MD1_AMY was also high (Spearman correlation = 0.54). The high within cohort correlations showed a cohort dependent structure and suggested that more studies may be needed to provide empirical evidence on the covariate impacts, particularly for the impact of antidepressant and suicide.
| Table 3Frequency of covariates appearing in RIM_minP model selection among 683 DE genes detected by maxP method under threshold FDR = 15% |
To further explore effects of covariates, we analysed a set of 9 genes that have been previously associated with MDD in the literature (see Table A and B). Intuitively, we expected that a covariate should be included in the model across studies more frequently than by random and effects of a covariate should have consistent differential expression direction across studies. We constructed two hypothesis testing using the co-appearing statistics T1 and concordant ratio statistics R described in the Method section and performed the tests on the set of 9 MDD-related genes. The result showed weak to marginal statistical significance of the first hypothesis (p = 0.397 based on RIM_minP and p = 0.011 based on RIM_BIC), suggesting covariates were consistently selected across studies by RIM_BIC but not RIM_minP. For the second hypothesis, tests for both 9 MDD gene list was statistically significant (p = 0.014 and 0.005). The effects sizes of selected confounding variables have concordant direction across studies. For example, age was found a confounding variable in gene NPY and TAC1 in three out of five studies and the effect sizes were all negative (in log scale). The moderate statistical significance is reasonable since the hypothesis tests were performed only on the nine selected genes. This result demonstrated that covariates overall impacted gene expression changes consistently and confounded with disease effects among the 9 MDD-related gene list.
Simulation results
The results of simulations to evaluate Type I error control and statistical power of different methods are shown in Table . In Scenario I simulation, the effect of disease state X on gene expression Y was confounded by two out of ten clinical variables in Z (Z = (z1,...,z10); z1 and z2 are confounders while the other eight variables are not). The result showed that t-test had low statistical power due to the confounders (power = 0.679). FEM_ALL also had low power due to the inclusion of all ten clinical variables in the model while in fact, only two of the ten were effective confounders (power = 0.697). Both FEM models with variable selection performed well. FEM_BIC performed slightly better than FEM_minP (power = 0.746 versus 0.729). The type I errors for all methods were close to the nominal 5% rate, showing adequacy of the models and statistical inference. For Scenario II, all clinical variables were independent from the gene expression. Not surprisingly, t-test performed the best with statistical power 0.938. FEM_minP and FEM_BIC both had similar high power at 0.929 and 0.925. FEM_ALL forced all variables in the model and obtained a low statistical power at 0.85. From Scenario I and Scenario II simulation, FEM_BIC and FEM_minP performed well in both extreme cases, demonstrating its sensitivity and robustness. Scenario III examined a special situation that variables in Z impacted gene expression Y through disease state X. Similar to Scenario II, t-test performed the best in this situation since Z is not confounded (power = 0.938). Both FEM_BIC and FEM_minP had similar high power (power = 0.925 and 0.916) but FEM_ALL again had low power (power = 0.851). For scenario I, we simulated two confounding variables, z1 and z2, where z1 had a strong association with Y (gene expression) while z2 had a weaker association with Y. In Table , an average of 1.97 variables was selected by RIM_minP and 1.27 variables were selected by RIM_BIC. Among them, 0.97 (by RIM_minP) and 0.78 (by RIM_BIC) variables belong to the true confounding variables (z1, z2). The result showed effectiveness of RIM_minP and RIM_BIC in variable selection compared to paired t-test (always contains no confounding variable) and RIM_ALL (always include all ten variables in Z). Overall, the simulation results confirmed our findings in MDD data analysis that modelling confounding variables with variable selection had better sensitivity and robustness in DE gene detection.
| Table 4Evaluation of t-test, FEM_minP, FEM_BIC and FEM_ALL methods by simulations |