Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3135688

Formats

Article sections

Authors

Related links

J Evol Biol. Author manuscript; available in PMC 2012 August 1.

Published in final edited form as:

Published online 2011 May 23. doi: 10.1111/j.1420-9101.2011.02297.x

PMCID: PMC3135688

NIHMSID: NIHMS291016

Dmitri V. Zaykin^{1}

Correspondence: Dmitri Zaykin, National Institute of Environmental Health Sciences, 111 T.W. Alexander Drive, South Bldg (101); Mail Drop: A3-03, RTP, NC 27709, vog.hin.shein@dnikyaz, phone: (919) 541-0096, fax: (919) 541-4311

The publisher's final edited version of this article is available free at J Evol Biol

See other articles in PMC that cite the published article.

The inverse normal and Fisher’s methods are two common approaches for combining *P*-values. Whitlock demonstrated that a weighted version of the inverse normal method, or “weighted *Z*-test” is superior to Fisher’s method for combining *P*-values for one-sided *T*-tests. The problem with Fisher’s method is that it does not take advantage of weighting and loses power to the weighted *Z*-test when studies are differently sized. This issue was recently revisited by Chen who observed that Lancaster’s variation of Fisher’s method had higher power than the weighted *Z*-test. Nevertheless, the weighted *Z*-test has comparable power to Lancaster’s method when its weights are set to square roots of sample sizes. Power can be further improved when additional information is available. Although there is no single approach that is the best in every situation, the weighted *Z*-test enjoys certain properties that make it an appealing choice as a combination method for meta analysis.

Evolutionary biologists have long used meta-analytic approaches to combine information from multiple studies. When raw data cannot be pooled across studies, meta analysis based on *P*-values presents a convenient approach that can be nearly as powerful as that based on combining data. Many popular *P*-value combination methods take the same general form, where *P*-value for the *i*-th study, *p _{i}*, is transformed by some function

Carefully chosen weights can, in general, improve power of combination methods. A motivation for the weighting may follow from the fact that different studies might be differently powered, and that should be reflected by the weighting. Consider the combined *P*-value of the weighted *Z*-test:

$${p}_{Z}=1-\mathrm{\Phi}\left(\frac{{\sum}_{i=1}^{k}{w}_{i}{Z}_{i}}{\sqrt{{\sum}_{i=1}^{k}{w}_{i}^{2}}}\right)$$

(1)

where *Z _{i}* = Φ

When different samples are taken from similar populations, a model that assumes a common effect size and direction among samples is appropriate. The ideal approach in this case is to pool raw data from all samples and to conduct a single statistical test. Whitlock considered such a test with its *P*-value and evaluated how well a combined *P*-value approximates this “true” *P*-value (Whitlock, 2005). He evaluated Fisher’s method for combining *P*-values (Fisher, 1932) as well as the unweighted and weighted *Z*-tests, using one-sided *P*-values. Indeed, Whitlock found via simulation experiments that a weighted version of the combination *Z*-test outperformed both Fisher’s and Stouffer’s methods. Nevertheless, weighted versions of Fisher’s method exist and it had remained unclear whether the power of a weighted version of Fisher’s method may be as powerful as that of the weighted *Z*-test. This issue was recently taken on by Chen who found that Lancaster’s generalization of Fisher’s test was more powerful than the weighted *Z*-test (Chen, 2011). In Chen’s application, *P*-values were transformed to chi-square variables by an inverse chi-square transformation with the degrees of freedom equal to the sample size of the study, i.e. Lancaster’s statistic is
$T=\sum {\left[{\chi}_{({n}_{i})}^{2}\right]}^{-1}(1-{p}_{i})$ with the distribution
$T\sim {\chi}_{(\sum {n}_{i})}^{2}$.

Both Whitlock and Chen used non-optimal weights for the weighted *Z*-test, setting them to the sample sizes of the studies. The original Whitlock’s conclusions are valid, but the weights need to be adjusted according to suggestions by Lipták and Won et al. In Whitlock’s setup, samples that corresponded to different studies were drawn from the same population. In this setup, the *T*-test based on pooled raw data can be viewed as an “ideal” test. In this case, optimal weights for the weighted *Z* method are given by the square root of the sample sizes,
$\sqrt{{n}_{i}}$. These weights are optimal in the sense that the combined *P* -value approximates the *P* value of the test based on raw data. This can be seen from writing out a *Z* statistic based on pooled raw data in terms of statistics for the individual studies. The pooled data statistic is
${Z}_{\text{total}}=\sqrt{{n}_{T}}\overline{T}/{\widehat{S}}_{T}$, where is the sample average for the total sample of size *n _{T}* and

$${Z}_{\text{total}}=\frac{{n}_{X}\overline{X}}{\sqrt{{n}_{T}}{\widehat{S}}_{T}}+\frac{{n}_{Y}\overline{Y}}{\sqrt{{n}_{T}}{\widehat{S}}_{T}}$$

while the weighted statistic that combines information from the two samples is

$${Z}_{w}=\frac{{w}_{X}{\scriptstyle \frac{\sqrt{{n}_{X}}\overline{X}}{{\widehat{S}}_{X}}}+{w}_{Y}{\scriptstyle \frac{\sqrt{{n}_{Y}}\overline{Y}}{{\widehat{S}}_{Y}}}}{\sqrt{{w}_{X}^{2}+{w}_{Y}^{2}}}$$

The pieces
${\scriptstyle \frac{\sqrt{{n}_{X}}\overline{X}}{{\widehat{S}}_{X}}}$ and
${\scriptstyle \frac{\sqrt{{n}_{Y}}\overline{Y}}{{\widehat{S}}_{Y}}}$ can be recovered from *P*-values for the two samples by the inverse normal transformation. This statistic is the weighted *Z*-test for combining *P*-values. We can see that *Z _{w}* approximates

Chen chose Lancaster’s method in favor of an extension of Fisher’s test where weighted inverse chi-square-transformed *P*-values are added, for the reason that “the distribution of the sum of weighted *χ*^{2} is usually unknown”. Several algorithms for obtaining this distribution have been published however, and are freely available. Duchesne and Lafaye De Micheaux recently described an R package that implements several approximations to that distribution as well as “exact” algorithms with guaranteed, user-controlled precision (Duchesne & Lafaye De Micheaux, 2010). The weighted Fisher’s test is a direct *χ*^{2}-based analogue of the weighted *Z*-test. Therefore, I included this method into comparisons. Specifically, the weighted Fisher’s test (the weighted *χ*^{2} test) is based on the distribution of the following statistic:

$${F}_{w}=\sum _{i=1}^{k}{w}_{i}{\left[{\chi}_{(2)}^{2}\right]}^{-1}(1-{p}_{i})$$

(2)

where ${\left[{\chi}_{(2)}^{2}\right]}^{-1}$ is the inverse cumulative chi-square distribution function with two degrees of freedom.

For simulation experiments I followed the setup of Chen and Whitlock. I assumed a *T*-test for the null hypothesis *H*_{0}: *μ* > 0 and values of *μ* from 0 to 0.1 with an increment of 0.01. For eight studies with sample sizes *n _{i}* of 10,20,40,80,160,320,640, and 1280, random samples were obtained assuming a normal distribution with the mean

Tables 1 and and22 present power values for the studied tests. Table 1 that followed the setup of Whitlock and Chen shows that the weighted *Z* test with weights
$\sqrt{{n}_{i}},1/\widehat{{\text{SE}}_{i}}$, Lancaster’s method, and the test based on pooled data all have nearly identical power. The weighted Fisher’s test has a slightly lower power. Table 2 shows power values for heterogeneity scenarios as well as type-I error rates for the case *μ* = 0 but with a random, study-specific variance. The total *T* test is no longer most powerful in this case, due to heterogeneity of effects. Weighting by either
$\sqrt{{n}_{i}}$ or by
$1/\widehat{{\text{SE}}_{i}}$ delivers the same improvement in power when only the means are heterogeneous between studies. When there is heterogeneity of the variances, weighting by
$1/\widehat{{\text{SE}}_{i}}$ yields a power advantage over weighting by
$\sqrt{{n}_{i}}$. Power is the highest when standardized effects (
$\mu /\widehat{{\text{SE}}_{i}}$) as used as weights. Correlations between the true and the combined *P*-values were found to be at least 99% for all values of *μ* for Lancaster’s and the weighted *Z* methods. The corresponding correlation for the weighted Fisher’s method was lower, ranging from about 91% to 94% depending on the value of *μ*. Tukey’s plots in Figure 1 show a good correspondence of *P*-values for the pooled data test with *P*-values for Lancaster’s and the
$\sqrt{{n}_{i}}$–weighted *Z* methods. Lancaster’s method forms a more “snowy” cloud and the weighted *Z* method *P*-values are somewhat closer to the true values.

Tukey’s plots of *P*-values for Lancaster’s and the
$\sqrt{{n}_{i}}$–weighted *Z*-test vs. the total *T* test. Top row: *μ* = 0. Bottom row: *μ* = 0.05.

Meta-analysis of *P*-values generally benefits from weighting. When samples are obtained from the same or similar populations, as in the model studied by Whitlock and Chen, the optimal weights for the *Z*-test are given by
$\sqrt{{n}_{i}}$. In this case, the weighted *Z*-test, Lancaster’s test and the test based on pooled data provide very similar power. This is expected, because Lancaster’s method approaches the weighted *Z* method asymptotically, as min(*n _{i}*) increases. When there is heterogeneity of variances, but the true mean is the same across studies, weighting by
$1/\widehat{{\text{SE}}_{i}}$ is optimal, but the gain in power is not great, compared to weighting by
$\sqrt{{n}_{i}}$ (0.784 vs. 0.743 at

In this study, one-sided *P*-values were assumed. Such *P*-values are appropriate for meta-analytic combination of *P*-values from several studies. Two-sided *P*-values are generally inappropriate, because they are oblivious to the effect direction. Two-sided *P*-values from two studies in which the effect direction is flipped can both be small nevertheless, resulting in an inappropriately small combined *P*-value. On the other hand, combined result of corresponding one-sided *P*-values will properly reflect cancellation of the pooled effect that would have been observed if raw data from the two studies were combined.

Despite the fact that the mechanics of the meta-analytic process involves manipulation of one-sided *P*-values, it is often the case that the final result needs to be a two-sided *P*-value. For example, when allele frequencies are compared between two groups of individuals classified based on the presence or absence of a trait, the null hypothesis is usually that the frequency is the same, and the alternative hypothesis does not specify a particular effect direction. The weighted *Z*-test provides an important advantage in dealing with this situation, due to symmetry of the normal transformation. There are two possible one-sided combined *P*-values for each assumed effect direction, but with the weighted *Z* method, the combined *P*-value for the first assumed direction is the same distance from ^{1}/_{2} as the combined *P*-value for the second assumed direction. Therefore, one can arbitrarily assume either one of the two directions when computing one-sided *P*-values, and obtain a combined one-sided *P*-value, *p*_{one-sided}. The two-sided combined *P* -value is the same regardless of the assumed direction:

$${p}_{\text{two}-\text{sided}}=\{\begin{array}{ll}2\phantom{\rule{0.16667em}{0ex}}{p}_{\text{one}-\text{sided}}\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{p}_{\text{one}-\text{sided}}<1/2\hfill \\ 2\phantom{\rule{0.16667em}{0ex}}(1-{p}_{\text{one}-\text{sided}})\hfill & \text{otherwise}\hfill \end{array}$$

(3)

What if available individual *P*-values are all two-sided? Often, studies report *P*-values that correspond to statistics such as |*T*| and |*Z*|, or its squared value, i.e. the one degree of freedom chi-square. These individual *P*-values can be converted to one-sided before combining as follows:

$${p}_{\text{one}-\text{sided}}=\{\begin{array}{ll}{p}_{\text{two}-\text{sided}}/2;\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}\text{effect}\phantom{\rule{0.16667em}{0ex}}\text{direction}>0\hfill \\ 1-{p}_{\text{two}-\text{sided}}/2;\hfill & \text{otherwise}\hfill \end{array}$$

Once again, the assumed effect direction can be chosen arbitrarily. For example, in testing for association of an allele with a trait at a biallelic locus *A*/*a*, we can arbitrarily choose one of the alleles, e.g. allele *A*. Then the “effect direction” for *i*-th study is positive if there is positive correlation of that allele with the presence of the trait in that study. Once these one-sided *P*-values are combined, the result can be converted back to two-sided by Equation (3).

Another advantage of the weighted *Z* test is that it can be easily extended to account for the case of correlated statistics between studies. For the test to be valid under independence, we need an assumption that the set of {*Z _{i}*} jointly follows a multivariate normal distribution under the null hypothesis. If cor(

Although there is no single method for combining *P*-values that is most powerful in all situations, a meta analytic setup considered by Whitlock and extended here to include study heterogeneity is quite general, because many forms of one-sided statistics approach a normal distribution asymptotically. Therefore, the
$\sqrt{{n}_{i}}$– or
$1/\widehat{{\text{SE}}_{i}}$–weighted *Z*-test for combining one-sided *P*-values can be recommended in most situations.

In this study, the weighted Fisher’s method showed slightly smaller power values compared to other methods in this study. If absolute values or squares of *T*-statistics for each study were assumed instead, as in calculation of two-sided tests, the weighted Fisher’s would have yielded higher power values than either Lancaster’s or the weighted *Z* methods. As already noted, combining individual two-sided *P*-values is generally not appropriate in meta-analysis, where the same hypothesis is tested in all studies. Combination of two sided *P*-values is more appropriate when individual tests are concerned with separate hypotheses. Small combined *P*-value in that case can be interpreted as evidence that one or more individual null hypotheses are false. Owing to the virtue of being sensitive to small *P*-values, the weighted Fisher’s method would provide good power, especially in those situations where there is pronounced heterogeneity of effect sizes between studies.

This research was supported by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences. I wish to express my appreciation for comments and suggestions by Professors Michael Whitlock and Allen Moore, and by an anonymous reviewer.

- Box GEP. Some theorems on quadratic forms applied in the study of analysis of variance problems, I. Effect of inequality of variance in the One-Way classification. The Annals of Mathematical Statistics. 1954;25:290–302.
- Chen Z. Is the weighted z-test the best method for combining probabilities from independent tests? Journal of Evolutionary Biology. 2011;24:926–930. [PubMed]
- Duchesne P, Lafaye De Micheaux P. Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods. Computational Statistics & Data Analysis. 2010;54:858–862.
- Dunnett CW. A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc. 1955;50:1096–1121.
- Farebrother RW. Algorithm AS 204: The distribution of a positive linear combination of
*χ*^{2}random variables. Applied Statistics. 1984;33:332–339. - Fisher R. Statistical methods for research workers. Oliver and Boyd; Edinburgh: 1932.
- Lipták T. On the combination of independent tests. Magyar Tud Akad Mat Kutato Int Közl. 1958;3:171–196.
- Stouffer S, DeVinney L, Suchmen E. The American soldier: Adjustment during army life. Vol. 1. Princeton University Press; Princeton, US: 1949.
- Tukey J. Exploratory data analysis. Addison-Wesley; Boston, Massachusetts, US: 1977.
- Whitlock MC. Combining probability from independent tests: the weighted Z-method is superior to Fisher’s approach. Journal of Evolutionary Biology. 2005;18:1368–1373. [PubMed]
- Wilkinson B. A statistical consideration in psychological research. Psychological Bulletin. 1951;48:156–158. [PubMed]
- Won S, Morris N, Lu Q, Elston R. Choosing an optimal method to combine P-values. Statistics in medicine. 2009;28:1537–1553. [PMC free article] [PubMed]
- Zaykin D, Zhivotovsky L, Westfall P, Weir B. Truncated product method for combining P-values. Genetic Epidemiology. 2002;22:170–185. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |