|Home | About | Journals | Submit | Contact Us | Français|
To estimate the multivariate regression model from multiple individual studies, it would be challenging to obtain results if the input from individual studies only provide univariate or incomplete multivariate regression information. Samsa et al. (J. Biomed. Biotechnol. 2005; 2:113–123) proposed a simple method to combine coefficients from univariate linear regression models into a multivariate linear regression model, a method known as synthesis analysis. However, the validity of this method relies on the normality assumption of the data, and it does not provide variance estimates. In this paper we propose a new synthesis method that improves on the existing synthesis method by eliminating the normality assumption, reducing bias, and allowing for the variance estimation of the estimated parameters.
Meta-analysis is a statistical technique for amalgamating, summarizing, and reviewing previous quantitative research. A typical meta-analysis is to summarize all the research results on one topic and to discuss reliability of this summary. It is based on the condition that each individual study reports the same finding for the same research question. The potential advantage of meta-analysis is the increase in the sample size and the validity of statistical inference. It would be difficult to utilize meta-analysis methodologies if individual studies only provide partial findings.
In a practical example, meta-analysis could be used to build a comprehensive and multivariate prediction model for the risk of chronic diseases such as coronary heart disease (CHD). A wide range of CHD risk factors have been reported in the literature, but a comprehensive multivariate CHD prediction model has yet to be found. The Framingham CHD model is widely considered the most comprehensive model, although many well-known CHD risk factors, such as body mass index (BMI), family history of CHD, and c-reactive protein, are not included in the model [1–3].
We propose a new process to solve several of the problems presented above. This novel multivariate meta-analysis modeling method is called synthesis analysis. Using multiple study results reported in the scientific and medical literature, the objective of our synthesis analysis is to estimate the multivariate relations between multiple predictors (Xs) and an outcome variable (Y) from the univariate relation of each X with Y and the two-way correlations between each pair of Xs. All the inputs may come from various studies in the literature, while a cross-sectional population survey may provide correlations of all Xs. We reported the first method of synthesis analysis (the Samsa-Hu-Root or SHR method) in which the partial regression coefficients were calculated using the following matrix equation:
where B is the vector of partial (excluding the intercept, B0) regression coefficients, Bu is the vector of univariate regression coefficients, R is the vector of Pearson correlation coefficients among all independent variables, S is the vector of standard deviations of the independent variables, # stands for element-wise multiplication, and/stands for element-wise division. The intercept, B0, can be calculated using the resulting multivariate formula, the mean of the predictors and outcome, and the newly calculated partial regression coefficient for each predictor.
In the present study, we propose an improvement to the existing synthesis analysis. Compared with the previous method, this method has at least two advantages: (1) it includes a method to compute the variances for predicted outcomes and estimated regression coefficients and (2) the estimates of predicted outcomes and regression coefficients can be more robust when the independent variables are not normally distributed.
Our paper is organized as follows. In Section 2, we describe our new method. In Section 3, we report a simulation study on finite-sample performance of the proposed method in comparison with the existing synthesis method. In Section 4, we illustrate the use of the proposed method in a real-life example from the 1999–2000 National Health and Nutritional Examination Survey. Finally, in Section 5, we conclude our paper with a discussion on some extensions.
Suppose that we know the individual relationships between an outcome Y and each of p risk factors, X1, X2, …, and Xp, which are given as follows:
where i = 1,2, …, p. In addition, we assume that we know the mean relationships between any two pairs among the p risk factors:
where i, j = 1,2, …, p and i ≠ j.
The goal of synthesis analysis is to determine the multivariate linear regression model between Y and the p risk factors:
Taking the conditional expectation of the both sides of (3) given Xi, we obtain the following equation:
for all x, where i = 1, …, p. Therefore, we obtain the following two sets of equations:
for i = 2, …, p; and
for i = 2, …, p.
Let M be a p × p matrix with diagonal elements 1, and element when i ≠ j; let β = (βk, k = 1, …, p), and . From (6), we obtain the following p equations for the p unknown slope parameters, β1, …, βp:
Using Cramer’s rule, we can easily solve the above p simultaneous linear equations. Let us define the following determinants:
Cramer’s rule gives us the following unique solution to the system of equations (8):
where k = 1, …, p.
After obtaining estimates of the vector of slope parameters, β, we can derive an estimate for the intercept parameter, β0, using any one of the p equations given in (6). Hence, we have the following p equations for the unknown intercept parameter β0:
Although there are p equations for the parameter β0, we show that the solution of β0 is unique in Appendix A. We give a detailed description of our solution for the two-covariate case in Appendix B, and in Appendix C, we give an explicit formula for our synthesized parameters in cases with three and four covariates.
The variance can be estimated using the delta method by assuming that the univariate parameter estimates and from individual univariate linear regression models, given by (1), are independent of each other . Let and .
By the well-known result from simple linear regression, we know:
where α0 and γ0 are the true expected values of α and γ,
where is the covariance between and , and
is the covariance matrix of the estimated parameters .
The synthesized parameter estimates β =(β0, βl, …, βp)T are functions of α’s and γ’s, which can be expressed mathematically as:
If the function g is differentiable, then the delta method gives the asymptotic variance of β as follows:
where g(α, γ) is the vector of derivatives of function g with respect to β=( β0, β1, …, βp). We give an explicit formula for g(α, γ) when p = 2 in Appendix B. Many programs, such as Mathematica, can perform derivatives symbolically, thereby making the variance calculation much easier, since the derivation of the exact form of the g is not required before the calculation.
Once the estimates of parameters and their variances have been derived, we can calculate the covariance matrix of predicted values as follows:
where XT is the transpose of the X matrix, and Σβ is the covariance matrix of β, given by (9).
The mean-squared error (MSE) of the predicted value is given by
where Ŷi and Yi are the predicted and observed value of subject i, respectively. The correlation coefficient between Ŷi and Yi, ρ, can be calculated by
where Cov(Ŷi, Yi) is the covariance between predicted and observed values.
We conducted a simulation study to assess the performance of the proposed method in comparison with our previous method , denoted by SHR. We simulated data with two, three, and four predictor variables. For simplicity of presentation, we only reported the results for the two-predictors here, because the results for three-predictor and four-predictor cases are similar to those in the two-predictor case.
In each of these cases, we simulated independent variables from (1) a multivariate normal distribution, (2) a multivariate log-normal distribution, (3) a multivariate exponential distribution, and (4) a multivariate gamma distribution. We chose the variances of all the independent variables to be 1 and correlations for pairs of the independent variables to be 0.5. After simulating the independent variables X, we generated the dependent variable Y by adding random normal errors to the mean model:
where ε is a random error following the standard normal distribution.
We set the true regression parameters as follows: (β0, βl, β2) = (−5, 5, 3) for the two-variable setting, (β0, βl, β2, β3) = (−5, 1, 3, 5) for the three-variable setting, and (β0, βl, β2, β3,β4) = (−5, 5, 4, 3, 1) for the four-variable setting. We divided each data set into (p = 2,3,4) subsets with equal sample sizes. Here, denoted the total number of combinations of choosing 2 items from (p + 1) items. In simulated data, each subset contained only one pair of variables chosen from Y, X1, …, Xp. The sample size (the total number of observations) used in simulation was 300 and 3000 (with equal size for each subset). For each of the above settings, we simulated a total number of 1000 data sets. As the results for the data from the skewed log-normal distribution were similar to those from the other skewed distributions, we only reported the results for the normal and log-normal distributions. We reported the mean bias and MSE for estimated parameters in Tables I and andIIII.
In order to evaluate the accuracy of predicted values using the new model, we simulated two data sets with equal sample sizes. One was used as the training set for model derivation, while the other was used as the validation data set. To evaluate prediction performance, we reported mean bias, MSE, and the mean of standard error estimates (SEEs) for predicted values in Tables III and andIV.IV. The SEEs were derived using the method developed in Sections 2.2 and 2.3. The correlations between predicted and observed values were also reported in the two tables.
Simulation results for the regression parameters showed that the mean bias and MSE of the estimated regression parameters using our new method were, in general, better than those using the SHR method, across all of the distributions and sample sizes considered here. The results also indicated that when the distributions of independent variables X were heavily skewed (log-normal distribution), the bias and MSE of the estimated regression parameters using both methods were large, especially when sample sizes were small. Nonetheless, the results from our new method were much better than those from the SHR method under this situation.
The results for predicted values indicated that both the new method and the SHR method had similar correlations between observed and predicted values across all sample sizes and distributions. However, mean bias and MSE for predicted values derived from our new method were much smaller than those from the SHR method.
In this section, we analyzed a real-world example and compared the results using our new synthesis method and the SHR method. The data came from the 1999–2000 National Health and Nutritional Examination Survey . There were five variables in this data set, including one outcome Y, systolic blood pressure, and four predictors, X1, X2, X3, and X4, which represented age, body mass index (BMI), serum total cholesterol level, and the natural log of serum triglycerides, respectively. First, we fitted a multivariate regression model to this data set, which would serve as the gold standard for this analysis. Next, we randomly divided the data set into the five mutually exclusive subsets with approximately equal sample sizes. The first four subsets included the outcome Y and each of the four covariates, X1, X2, X3, and X4, respectively. The last subset contained all four covariates, which was used to derive pairwise correlations among the covariates. We applied the two synthesis methods to these five subsets to obtain estimated parameters in the multivariate regression model and reported the results in Table V. For comparison purposes, we also included the estimated parameters in the multivariate regression models obtained by the gold standard model in Table V.
The estimated parameters and their standard errors (SEs) from the gold standard and from both our new method and SHR method were listed in Table V (SE was not available by the SHR method). From these results, we observed that the new method produced the coefficient estimates that were comparable to those derived using the gold standard. However, the estimates for Intercept and LOGTRIG from the SHR method were varied somewhat from those derived using the gold standard method. As an illustration, the predicted value for a 65-year-old subject with the BMI of 19, the serum total cholesterol level of 190, and the serum triglycerides of 160 would be 134, 135, and 136, using the gold standard method, the new method, and the SHR method, respectively.
In this paper, we provided several enhancements to the existing SHR synthesis analysis methodology. These improvements allow for more robust estimates of the regression parameters and predicted values when covariates are not normally distributed. Additionally, the new method allows for estimation of the variance of the resulting parameters and predicted outcomes.
Both the previously reported SHR method and our improved method allow for the building of multivariate regression models using univariate regression coefficients and two-way correlation coefficient data that are derived from different data sources. The underlying assumption is that each individual study is representative of the target population. However, the validity of the previously reported SHR synthesis analysis methodology relies on the normality assumption of the data. Although synthesis analysis is related to both meta-analysis and analysis of missing data, it is also different from these two traditional analyses in two important ways. First, while the goal of traditional meta-analysis is to combine the multivariate regression models with the same covariates from different studies, the goal of synthesis analysis is to create a multivariate linear regression model from univariate linear regression models on different covariates. Although the statistical problem that synthesis analysis address may be considered as one particular type of missing-data problem, unlike a traditional analysis, synthesis analysis does not require individual level data; rather, synthesis analysis only requires coefficient estimates of univariate linear regression models between the outcome and a covariate and between any two covariates.
Although the proposed method was developed to synthesize different univariate linear regression models with different covariates into multivariate linear regression models, it can be easily extended to the setting in which several studies are available for some (or all) of the univariate regression models. In this case, there would be variation among the parameter estimates. For example, if there are five studies available for the linear model, E(Y | X1), and six studies for the linear model, E(X1 | X2), then we would have the five sets of estimates for the intercept and slope of the linear model of Y on X, denoted by and , for j = 1, …, 5, and the six sets of estimates for the intercept and slope of the linear model of X1 on X2, denoted by and , for k = l, …, 6.
In this case, we propose to first combine the results on the same univariate regression model from different studies into the one univariate regression model using the weighted mean of and , with the weight being the inverse sample size; that is,
where Nj is the sample size for the jth univariate model between Y and X1, and . Then, we apply the proposed synthesis method in Section 2 to obtain the multivariate regression model.
We performed a simulation study to assess the performance of the modified method in the two independent variables case, with one independent variables following a normal distribution and another following a log-normal distribution. We also compared this modified method with other combining methods, including mean, median, minimum, and maximum of multiple estimates for a same regression parameter. From these simulation results, we concluded that parameter estimates using the weighted mean had the smallest bias and MSE, and were very close to the bias and MSE using the gold standard. In addition, the predicted value using the weighted mean had the smallest bias, MSE, and SEE. We give a detailed description on our simulation study and results in Appendix D. The computer software for implementing the proposed method is available at http://faculty.washington.edu/azhou.
We would like to thank Vicki Ding and Hua Chen for their help in preparing this manuscript. Xiao-Hua Zhou, PhD, is presently a Core Investigator and Biostatistics Unit Director at the Northwest HSR&D Center of Excellence, Department of Veterans Affairs Medical Center, Seattle, WA. The views expressed in this article are those of the authors and do not necessarily represent the views of the Department of Veterans Affairs. This study has been partially supported by NSFC 30728019.
Contract/grant sponsor: NSFC; contract/grant number: 30728019
Here we show that there is a unique solution for the intercept term β0 with the p equations (5), meaning that we need to show that the following p solutions are equivalent:
Without losing generality, we only show that the solutions of the first two equations are equal, that is, . The proof for other solutions is similar.
In order to show
Because , we can get the following result:
Hence, we can replace ( ) with in (A2) and obtain the following result:
Because β1, …, and βp are the solutions of Mβ = γ1, we can obtain the following result:
Hence, the right side of (A4) becomes , which equals to E(Y) because .
Similarly, we can proof the right side of (A1) plus E(X1)β1 + E(X2) β2 + ··· + E(Xp)βp is also equal to E(Y). This completes the proof.
When p = 2, we can also have an explicit formula for the derivative of β = g(α, γ) with respect to agr; and γ, g(α, γ), for the two independent variables case. Here, g(α, γ) is used to calculate the variance of β and predicted values.
When there are three predictors in the model, D and Di, (i = 1, 2, 3) are given as follows:
If there are four predictors in the regression model, the D and Di, (i = 1, 2, 3, 4) are as follows:
We performed a simulation study to assess the performance of the modified method, as described in the discussion section, for the two independent-variable case when the vector of two covariates follows a bivariate normal distribution or bivariate log-normal distribution. We also compared this modified method with the other combining methods, including mean, median, minimum, and maximum of multiple estimates for a same regression parameter. For each of the three univariate linear models, E(Y | X1), E(Y | X2), and E(X1 | X2), there were the estimates from five different studies. We selected the sample size for each of the five studies for each univariate model to be equal (1000 and 100) or unequal (100, 200, 500, 1200, 3000) or (10, 20, 50, 120, 300). We assessed the performance of the modified synthesis method using the weighted mean, mean, median, minimum, and maximum of combing results from the five studies.
Since our results on the simulated data from the bivariate normal distribution are similar to those on the simulated data from the bivariate log-normal distribution, we only report the results on the bivariate normal distribution case. Tables DI–DIV show the bias and MSE for each of the regression parameters β0, β1, β2 as well as the mean bias, MSE, correlation, and SEE (mean of SE estimates) for the predicted values.