We studied the detection probability (DP) of a two-stage GWAS design, that is, the chance that a given disease-associated SNP will have among the lowest ranks of p-values (or highest ranks of chi-square statistics) at stages 1 and 2. Our data for fixed effects models indicate that the DP from a two-stage design with
πsample ≤0.25 and 8000 cases and controls can be substantially less than that of the corresponding one-stage design with the same numbers of cases and controls for odds ratios per allele of 1.1, 1.2, and 1.3, which are typical of statistically significant odds ratios found in recent large GWASs. For the range of values
T1≤25,000 that we studied, a “joint” analysis cannot appreciably increase the DP of the two-stage design if
πsample ≤0.25. Similar results are found for corresponding random effects disease models, which yield somewhat smaller DPs. To achieve an adequate DP, the first stage must have enough cases and controls to assure that a high proportion of disease-associated SNPs have among the
T1 lowest p-values at stage 1. Thus, if 16,000 cases and controls were available for study, choosing
πsample ≤0.25 would yield acceptable DP, as seen from calculations for the one-stage design(
Gail et al., 2008). Except for settings where enormous numbers of cases and controls are available for study, however, designs with
πsample ≤0.25 should be avoided. Software is available from the first author to allow researchers to study other parameter values and sample sizes.
Our data suggest that additional stage 1 genotyping in most previous studies with
πsample ≤0.25 will yield additional promising SNPs and that future multistage designs should not use
πsample ≤0.25, unless the numbers of available cases and controls are very large. Other considerations also favor using larger values of
πsample. As the cost per genotype of chips designed for stage 1 decreases relative to that for specialized platforms for subsequent stages, cost considerations argue for larger values of
πsample (
Skol, 2007,
Wang et al., 2006). The costs of obtaining cases and controls also favor a larger value of
πsample (
Müller et al., 2007).
The two-stage ranking and selection procedure analyzed in this paper differs from two-stage procedures that apply the same fixed critical values to data from each SNP and are designed to select promising SNPs in stage 1 and provide a final p-value for testing an association following stage 2, as in (
Skol et al., 2006). In particular, the two-stage ranking and selection procedure does not attempt to control the overall p-value, but only to obtain a very promising set of
T2 SNPs at the end of stage 2. Despite these different goals and methods, Figure 2 in (
Skol et al., 2006) shows that power diminishes appreciably, and cannot be retrieved by joint analysis, if
πsample ≤ 0.20 and
πmar ker
T1/
T0 ≤ 0.1, in agreement with our findings for DP.
In some circumstances, power calculations can be used to approximate DP. For a one-stage design with
M0 = 1, equation (2.6) in (
Gail et al., 2008) shows that DP can be approximated by the power that corresponds to an hypothesis test with size
α =
T1/
T0. Although power calculations performed in this way and extended to the two-stage design may approximate the DP under certain conditions, the results in (
Skol et al., 2006) were not based on such significance levels and critical values. We illustrate these differences using the program provided by (
Skol et al., 2006) at
http://csg.sph.umich.edu. For
πsample =0.125,
πmar ker
T1/
T0 = 25, 000/500, 000 = 0.05, genome-wide false-positive rate 0.05, which corresponds to
α = 0.05/500, 000 = 10
−7 for the joint analysis, a single fixed minor allele frequency of 0.2673 for all SNPs, and an assumed disease prevalence of 0.10, this program yields power estimates of 0.83 for the replication analysis and 0.83 for the joint analysis. The corresponding critical value for a normal deviate for stage 1 is 1.96, and for stage 2, the critical values are 4.611 for the replication analysis and 5.189 for the joint analysis. For the ranking procedure with
M0 =1,
T1 = 25, 000 and
T2 = 1, 10, or 100, the DP was 0.63, 0.65, and 0.66 respectively for the replication analysis and 0.64, 0.66 and 0.66 respectively for the joint analysis () in the realistic setting in which minor allele frequencies are drawn from the distribution in CGEMS, with mean 0.2673. If instead it was assumed that all SNPs had the same minor allele frequency, 0.2673, as was assumed for the power calculations, the corresponding estimates of DP were 0.74, 0.75 and 0.75 for the joint analysis. These calculations illustrate that there are quantitative differences between assessments based on power and those based on detection probability, even though the broad conclusion that
πsample should not be too small is supported by both analyses.
It is worthwhile to recount some differences between power and detection probability. Power is the probability that the test statistic for a given SNP will fall into the pre-determined critical region for a one- or two-stage design that is chosen to control a genome-wide significance level, as for example in (
Skol et al., 2006). Power thus depends on the chosen significance level; DP depends, instead, on
T0,
T1, and
T2. The power to reject the null hypothesis for a given SNP does not depend on the test statistics for any other SNP; DP depends on the test results for all SNPs. Power does not depend on the number of disease-associated SNPs,
M 0; DP can be sharply reduced by competition among disease-associated SNPs, especially if
T2 is less than
M 0. Most power calculations assume that all disease-associated SNPs have the same minor allele frequency; DP calculations routinely allow for allele-frequencies to be drawn from a realistic distribution of allele minor frequencies. DP estimates the probability that a given disease-associated SNP will have among the smallest
T2 p-values at the end of a two-stage study; power, as routinely calculated, does not have this interpretation. If disease-associated SNPs are exchangeable, as we assume in the fixed effect and random effect models (see METHODS), DP also has an interpretation as the proportion of disease-associated SNPs that have among the smallest
T2 p-values at the end of a two-stage study. Thus, the estimate of DP can be used to estimate how many more disease-associated SNPs with similar odds ratios to those already found are likely to be discovered by conducting another study with a larger stage1 sample.
Satagopan and colleagues (
Satagopan et al., 2004,
Satagopan et al., 2002) also studied ranking procedures to identify disease-susceptibility SNPs in two-stage designs, but used different rank-based criteria from DP and also assumed that the disease allele was known. The two-sided versions of a Wald test or a score test that we used have the same value whether one counts major or minor alleles, and hence are particularly appropriate for GWASs(
Devlin & Roeder, 1999,
Pfeiffer & Gail, 2003).
The ranking and selection methods used in this paper depend on the assumption that tagging SNPs are independent (
Gail et al., 2008), an assumption that is also widely used in power calculations, e.g. (
Skol et al., 2006). Zaykin and Zhivotovsky (
Zaykin & Zhivotovsky, 2005) analyzed different ranking criteria for one-stage designs and found that selection probabilities were little affected by correlations of p-values within linkage disequilibrium blocks or among such blocks. In unreported simulations in which non-disease associated SNPs were paired and each member of the pair assigned the same chi-square value (perfect correlation within pairs), we found almost no effect on the estimates of DP and PP, compared to the situation in which SNP gentotypes are independent. Thus it is likely that our estimates of DP are robust to local correlations among tagging SNPs.
In view of the potential losses in DP in multistage designs and trends in costs favoring large values of πsample, the one-stage design becomes increasingly attractive. Another advantage of the one-stage design is that it yields data that can readily be used in meta-analyses. For example, if a preliminary study identifies a particular SNP as associated with disease, data from independent one-stage studies can be used to test the association and provide an unbiased estimate of the corresponding odds ratio. Later stages in a multistage design would provide no information if that SNP had not been tested in the later stages.