For our simulation, we studied a candidate gene region with 11 SNPs. Their haplotype frequencies are given in Figure 3 of Yeager et al. (2007)
. There were a total of 25 haplotypes in the study population. To assign genotypes at the 11 SNPs for a subject in the study population, we independently selected pairs of haplotypes according to the haplotype frequencies. To evaluate the power of the proposed multivariate tree model and of the univariate-MinP approach, we designated the 5th
SNP in the candidate region as the single functional locus (one-locus risk model) that influences the quantitative traits under study. We assumed that this functional SNP was not typed and that genotypes at the other 10 SNPs were available for the association study. The power was calculated as the probability that the joint effect on the outcome from the genotypes at the 10 measured SNPs is significant under a given significance level (according to the adjusted p-value from the multivariate tree model or the univariate-MinP method). In addition to the genotypes, we considered a non-genetic binary covariate X
that had to be adjusted for in the analysis. We let Pr(X
= 1) = 0.4 and Pr(X
= 0) = 0.6 , and we further assumed that X
and the genotypes were independent in the study population. Let g5
be the coded value for the unobserved genotype at the functional locus, with values of 0, 1, and 2 representing the copy number of the minor allele “1”. Conditioned on X
, we generated a multivariate quantitative trait Y
)′ from the multivariate normal distribution MVN
) , where μ
, with β
)′ and γ
)′ being the given coefficients and V
being the covariance matrix. In our simulation, we considered a multivariate trait consisting of 2 or 3 phenotypes (i.e., M
= 2, or 3). For simplicity, we assumed a covariate effect βm
= 0.2, m
= 1, …, M
, for all conducted simulations. Besides the multivariate quantitative traits, we considered multivariate binary traits by dichotomizing the Y
, …, yM
)′ generated from MVN
) using the expected median value of ym
= 1, …, M
, as the cut-point.
To evaluate the type I error rate, we considered a bivariate trait (quantitative or its dichotomized version) unrelated to the genotype, i.e., γm
= 0, m
= 1,2. We assumed the covariance matrix
, with ρ
= 0, 0.1, or 0.2. For each value of ρ
, we ρ
1 generated 2000 datasets, each consisting of 800 subjects with their bivariate trait, genotypes at 10 SNPs and covariate X
simulated according to the procedure described above.
To evaluate power, we simulated a bivariate trait (quantitative or its dichotomized version) by assuming γm
= 0.18 for both phenotypes, and the covariance matrix
, with ρ
being chosen among (-0.3, -0.2, -0.1, 0.0, 0.1, 0.2, 0.3). We also simulated a trivariate trait with the associated variance-covariance matrix
under the following three scenarios, Scenario I: γ1
= 0.18, and γ2
= 0; Scenario II: γ1
= 0.18 , and γ3
= 0 ; and Scenario III: γ1
= 0.18. This covariance matrix was similar to the one observed in the application described later. Under each considered case, we simulated 500 datasets, each consisting of 800 subjects for the power evaluation.
We analyzed each simulated dataset using model (2)
allowing coefficients to vary for different outcomes. In the tree-building process, we considered splitting rules based on genotypes (coded as 0, 1, and 2, corresponding to the copy number of a chosen allelic type) on a single SNP, and we treated genotypes as continuous variables, i.e., the legitimate splitting rules are: 0 vs. 1 and 2; and 0 and 1 vs. 2, but not 1 vs. 0 and 2. We used different assumptions about the correlation matrices in the GEE method (e.g., independent and unstructured) and obtained quite similar results based on simulated datasets (see Web ). In the following, we present results based on analyses using “independence” as the working correlation matrix in the GEE method.
Table 1 Estimated type I errors of the multivariate tree model under the analysis of a bivariate phenotype with the covariance matrix
. For either the continuous outcome or its dichotomized version, the type I error rate was estimated based on 2,000 simulated (more ...)
Simulation results for type I error evaluation are given in . It appears that the proposed method has a type I error rate close to the nominal levels (α = 0.01, 0.05) under all covariance structures considered for both continuous and binary traits. We also evaluated the type I error rate of the multivariate tree model when the tree size was pre-specified. Results are given in Web . We can see from the table that the p-value estimated from the permutation procedure for a tree with a given size has the desired property.
Table 2 Power comparison at the significance level of 0.05 for the analyses of a bivariate phenotype with the covariance matrix
simulated under the one locus risk model. For either the continuous outcome or its dichotomized version, the power was estimated (more ...)
Power calculations for the simulated bivariate outcomes are given in . Under the given covariance matrix, we compared the power of the multivariate analysis (applying the proposed method to study two phenotypes jointly) with that of the univariate-MinP procedure (multiple-comparison adjusted univariate analysis). Results in demonstrate the advantage of using the multivariate approach for both the quantitative and binary traits. Interestingly, the power improvement over the univariate-MinP analysis increases as the correlation between the two outcomes changes from positive to negative. The explanation is that the negative correlation between the two outcomes helps to reduce the variance in the coefficients’ estimation when the underlying effects (γ1 and γ2) for the two correlated phenotypes point in same direction (both either positive or negative).
Power calculations for the simulation of a trivariate quantitative outcome are given in . For each simulated dataset, we analyzed various subsets of the multivariate outcome using the joint analysis as well as the univariate-MinP analysis. From we can observe the gain in power when individual phenotypes related to the same gene were studied together over the analysis of individual phenotypes. The power to detect the gene reaches its highest level when all relevant phenotypes are studied jointly. But when the multivariate trait under study includes one or more phenotypes not related to the gene under study, the power is diminished to some extent.
Power comparison for the analyses of three phenotypes Y1 , Y2 , and Y3 at the significance level of 0.05. The power was estimated based on 500 simulated datasets, with each consisting of 800 subjects.
We further evaluated the performance of the proposed method by simulating data under a 2-locus risk model, and we found that our method performed very reasonably (See Web ). The simulation design is described in the Web Appendix C.
We evaluated the convergence properties of the iterative algorithm through extensive simulation studies. We found that the algorithm usually converged within 50 iterations in the simulation setup described above. When the sample size was 500, the algorithm converged in 995~1000 out of 1000 replicated datasets under the various considered scenarios. When the sample size increased to 1000, the algorithm converged in 998-1000 out of 1000 replicated datasets. The working matrix did not appear to have an impact on the convergence properties of the algorithm; neither did the type of outcome (continuous or binary). When the algorithm did not converge, it always iterated between two tree models. In these rare situations, we chose the tree model with the larger Wald test statistic.