Biostatistics. 2009 April; 10(2): 324–326.
Published online 2008 December 3.
PMCID: PMC2648901

# Optimal 2-stage design with given power in association studies

Jiexun Wang
Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China and Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan
Hua Liang
Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14642, USA

In the genetic association studies, the 2-stage design as a cost-effective design has received much attention recently. In this note, we focus on the 2-stage design in which DNA pooling is used in the first stage and individual genotyping is used in the second stage. An important problem with such a design is how to optimize it. Zuo and others (2008) investigated this problem under the given cost. The objective of this note is to solve the problem under the requirement of the given statistical power. In practical applications, the sample in the first stage can be reused in the second stage, such a 2-stage scheme is called “the 2-stage dependent design”. On the other hand, we may use 2 separate samples in the 2 stages with one sample used for screen and the other used for confirmation. Such a 2-stage scheme is called “the 2-stage independent design” (Zuo and others, 2006). We will consider how to optimize the parameters in these 2 kinds of 2-stage design so that the total cost of study is minimized when a given power is required. As mentioned by Satagopan and Elston (2003), this task can be completed by minimizing the cost fraction between the 2-stage design and the 1-stage design using individual genotyping, where the overall significance levels and the total sample sizes are the same for the 2 designs, and their powers are as close as possible.

Using the notation in Zuo and others (2008), the cost functions for the 2-stage dependent and independent designs can be expressed as

and

respectively. For the 1-stage design with individual genotyping, the cost function is given by

where N is the sample size attaining the desired power of $1−β$ with an overall significance level of α. Thus, when the total sample size of the 1-stage design equals that of the 2-stage design, the goal of minimizing $T2,De/T1$ (or $T2,In/T1)$ is equivalent to minimizing $STDe/ST1≡ωDe$ (or $STIn/ST1≡ωIn)$ for the 2-stage dependent (or independent) design, where

with $r=Cpool/Cind$.

The constraints on the powers of the 2-stage dependent and independent designs (denoted by$PDe$ and $PIn)$ are $(1−β)−PDe≤eDe$ and $(1−β)−PIn≤eIn$, respectively, where eDe and eIn are some small numbers such as 0.01 and 0.03.

To obtain the optimal choices of the parameters in the 2-stage design with a desired power, we use a calculation procedure provided in the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org). We consider the population frequency of allele A of p=0.05, 0.2, or 0.7 and the allele frequency difference between the cases and controls of pApU=0.05 or 0.10 and assume that the overall significance level is α=0.05 and the power of the 1-stage design is 1−β=0.8. We set the number of the total markers as M=25, or 500, or 106 and the number of the true disease markers as K=1 or 5 and let $r=Cpool/Cind=1.5$, $eDe=0.01$, and $eIn=0.03$.

Our calculation results show that for the 2-stage dependent design, the cost saving is very big, especially when the total number of markers is large. On the other hand, we observe that genotyping errors with common error rates have no large effect on the saving in cost, although the cost saving is slightly more with the increase of genotyping error rates. However, the measurement errors with DNA pooling have large effect on the optimal 2-stage dependent design. By forming multiple pools, such an effect can be reduced substantially. For the 2-stage independent design, we find that the cost saving largely depends on the measurement error rates in the first stage. For the usual error rates with DNA pooling, the optimal design tends to be the 1-stage individual genotyping design and in this case, there would substantially be no saving in cost. Also, unlike the situation of the 2-stage dependent design, forming multiple pools does not necessarily increase the cost saving for the 2-stage independent design. However, when the measurement error rates are very small, the optimal design tends to be the 1-stage DNA pooling design and in this case, the saving in cost can be substantial.

Comparing the 2-stage dependent and independent designs, we can save much money by using the 2-stage dependent design. This becomes clearer by observing Figure 1.

Cost comparison for the 1-stage design and the optimal 2-stage dependent and independent designs with the different power in the case of the population frequency of allele A, $p=0.05$, the allele frequency difference between the cases and controls $pA−$ ...

## FUNDING

National Institutes of Health (AI62247-01 and AI59773) to H.L.; National Natural Science Foundation of China (70625004, 10721101, and 70221001) to G.Z.

## Supplementary Material

[Supplementary Material]

## Acknowledgments

The authors are grateful to the referee for the insightful comments and suggestions. Conflict of Interest: None declared.

## References

• Satagopan J M, Elston R C. Optimal two-stage genotyping in population-based association studies. Genetic Epidemiology. 2003;25:149–157. [PubMed]
• Zuo Y, Zou G, Wang J, Zhao H, Liang H. Optimal two-stage design for case-control association analysis incorporating genotyping errors. Annals of Human Genetics. 2008;72:375–387. [PubMed]
• Zuo Y, Zou G, Zhao H. Two-stage designs in case-control association analysis. Genetics. 2006;173:1747–1760. [PubMed]

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

 PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers.