|Home | About | Journals | Submit | Contact Us | Français|
Large two-stage genome-wide association studies (GWASs) have been shown to reduce required genotyping with little loss of power, compared to a one-stage design, provided a substantial fraction of cases and controls,πsample, is included in stage 1. However, a number of recent GWASs have used πsample <0.2. Moreover, standard power calculations are not applicable because SNPs are selected in stage 1 by ranking their p-values, rather than comparing each SNP’s statistic to a fixed critical value. We define the detection probability (DP) of a two-stage design as the probability that a given disease-associated SNP will have among the lowest ranks of p-values at stage 1, and, among those SNPs selected at stage 1, at stage 2. For 8000 cases and 8000 controls available for study and for odds ratios per allele in the range 1.1–1.3, we show that DP is substantially reduced for designs with πsample ≤0.25, and that DP cannot be appreciably increased by analyzing the stage 1 and stage 2 data jointly. These results suggest that multistage designs with small first stages (e.g. πsample ≤0.25) should be avoided, and that additional genotyping in earlier studies with small first stages will yield previously unselected disease-associated SNPs.
Two-stage genome-wide association studies (GWASs) have been advocated because they can reduce required genotyping with little loss in power(Satagopan & Elston, 2003, Satagopan et al., 2004, Skol et al., 2006). In stage 1 of these designs, all available SNPs are analyzed among a proportion, πsample, of the available cases and controls; only the SNPs judged to be promising in stage 1 are studied in the remaining cases and controls (stage 2). If the cost per genotype is lower in stage 1, larger πsample is required to minimize total cost(Skol, 2007, Wang et al., 2006). Likewise, accounting for the cost of obtaining the cases and controls leads to larger values of πsample (Müller et al., 2007). Optimal designs typically require πsample to be 0.5 or more. Figure 2 in(Skol et al., 2006) shows that power diminishes appreciably if πsample ≤ 0.20. Nonetheless, some investigators use πsample <0.2 and try to compensate by selecting a large number of SNPs in stage 1 for further study in later stages.
The most studied type of two-stage design is based on an hypothesis testing paradigm: a SNP is said to be disease-associated if its test statistic exceeds a fixed critical value c1 at stage 1 and if its test statistic at stage 2 (usually based on the combined stage 1 and stage 2 data) exceeds a fixed critical value c2 (Müller et al., 2007, Satagopan & Elston, 2003, Skol, 2007, Skol et al., 2006, Wang et al., 2006). This hypothesis testing approach not only identifies promising SNPs in the first stage, but also provides an overall p-value for assessing association with disease.
However, many studies are not conducted this way. Instead, a SNP is selected for further study in stage 1 if its p-value ranks among the T1 smallest p-values. Thus, selection of promising SNPs is not based on a fixed critical value but on ranking the SNPs against each other. This strategy was used in the Cancer Genetic Markers of Susceptibility Study (CGEMS) for prostate cancer (http://cgems.cancer.gov/about/replication_strategy.asp) and in studies of various other diseases(Broderick et al., 2007, Buch et al., 2007, van Es et al., 2007). In the CGEMS prostate cancer study, about n1 = 1200 cases and controls were studied in stage 1 (i.e. πsample = 0.14), and roughly T1 = 28,000 SNPs with the smallest p-values were selected for study in 7200 additional cases and controls in subsequent stages.
One approach following stage 1 is to continue ranking and selecting the SNPs based on data in stage 2. In particular, one selects a SNP as highly promising at the end of stage 2 if its p-value, which may be based on stage 2 data alone or on some combination of stage 1 and stage 2 data, ranks among the lowest T2 p-values from the T1 SNPs studied in stage 2. In this paper we describe the properties of this two-stage selection procedure, and in particular, we calculate the probability that a disease-associated SNP will be selected at the end of stage 2 (the “detection probability”) and the proportion of selected SNPs that are expected to be disease-associated (the “proportion positive”). If a two-stage procedure is designed to have a high detection probability and proportion positive, then one can expect that most disease-associated SNPs will be included in the T2 selected SNPs and that independent epidemiologic studies will confirm the association with disease in a good proportion of the T2 selected SNPs.
A second approach is to test for associations with disease among the SNPs selected in stage 1. In a study of gallstone disease(Buch et al., 2007), the authors used the independent stage 2 data alone to produce a p-value. In a study of amyotrophic lateral sclerosis(van Es et al., 2007), the authors ranked SNPs in stage 1 and culled SNPs further by requiring a nominal p-value <0.1 in stage 2 before assigning a p-value based on independent stage 3 data. To study genetic associations with colorectal cancer, investigators combined data from stages 1 and 2 to produce an overall p-value(Broderick et al., 2007). The statistical properties of these hybrid approaches have not been evaluated, although p-values that depend only on an independent final stage should have nominal significance levels.
In previous work(Gail et al., 2008), we studied ranking procedures for one-stage designs and defined the “detection probability” (DP) as the probability that a disease associated SNP would rank among the T1 largest test statistics (or T1 smallest p-values). For a two-stage design, we now define DP as the probability that a disease-associated SNP will rank among the T1 largest statistics (or T1 smallest p-values) at stage 1 and among the T2 largest statistics (or T2 smallest p-values) at stage 2. As in(Gail et al., 2008), we can also compute the expected proportion of the T2 selected SNPs that are truly disease-associated, namely the proportion positive (PP), under the assumption that the true number of disease-associated SNPs is known. We provide methods for assessing DP in the two-stage design, not only for a “replication analysis” that is based on the rankings in stage 2 data alone, but also for a final “joint” analysis(Skol et al., 2006) that combines data from stages 1 and 2. We show that for magnitudes of odds ratios commonly found in GWASs, the DP can be much lower for a two-stage design with πsample ≤0.25 than for a one-stage design with the same number of cases and controls, and that DP is hardly increased by an optimal joint analysis of stage 1 and stage 2 data combined when πsample ≤0.25. This result is analogous to findings for power in(Skol et al., 2006). Our calculations indicate how much the probability of selecting a disease-associated SNP can be increased by additional genotyping in such studies, and add to arguments in favor of one-stage designs. Finally we discuss implications for the design of GWASs.
A two-stage GWAS analyzes T0 (say 500,000) tagging SNPs in n1 cases and n1 controls (stage 1) and selects T1 promising SNPs for study in stage 2 in independent cases ( n2) and controls ( n2). Following stage 2, T2 SNPs are selected in the hope that they are associated with disease. The proportion of cases and controls in stage 1 is πsample n1/(n1+n2).
Assuming that the T0 tagging SNPs are in linkage equilibrium as in(Skol et al., 2006), we proved that SNP genotypes are independent not only in controls, but also in cases for a rare disease(Gail et al., 2008). We used these ideas and asymptotic theory for the Wald and score tests to develop efficient procedures for simulating the chi-square test statistics for the M0 disease-associated SNPs (“disease SNPs”) and the T0−M0 non-disease SNPs in stage 1. For independent stage 2 data, we now extend these methods to simulate independent chi-square tests for T1 SNPs, of which a random number, M1, are disease SNPs that were selected at stage 1, and the remaining T1−M1 are non-disease SNPs. We study odds ratios per allele of 1.1, 1.2, 1.3, and 1.5, but we use 1.2 in the following description. We consider two models(Gail et al., 2008) for disease SNPs. In the fixed effects model, the log odds ratio per disease allele is fixed at β = log (1.2) for each disease SNP. Thus the relative odds is 1.44 for a homozygote and 1.2 for a heterozygote. In the random effects model, β is drawn independently for each disease SNP from a normal distribution with mean zero and standard deviation τ = (π/2)1/2 log(1.2) ≈1.253log(1.2). This value of τ yields an expected absolute value of β of log(1.2). Under both models, β =0 for non-disease SNPs. One chi-square test is the squared Wald statistic, 2/ar (), for testing β =0 in a model for log odds of disease that equals an intercept plus β times the number of minor alleles(Gail et al., 2008). We also studied the corresponding squared score test(Armitage, 1955, Sasieni, 1997). Each of these chi-square tests has the same value whether the major or minor allele confers risk.
At stage 1, the T1 SNPs with the largest chi-square tests (or smallest p-values) are selected. We study two analytical approaches, “replication analysis” and “joint analysis,” similar to (Skol et al., 2006). For a given SNP whose test statistic was in the critical region in stage 1, Skol et al. called an hypothesis test based only on the stage 2 data a “replication analysis.” They used the term “joint analysis” for a final hypothesis test based a linear combination of the test statistics for stage 1 and stage 2 data. Analogously, we use the term “replication analysis” if final selection of the T2 SNPs from among the T1 SNPs selected in stage 1 depends only on their rankings derived from the independent stage 2 data. In “joint analysis”, for each of the T1 SNPs selected in stage 1, we compute λ C1 + (1−λ)C2 for λ = 0, 0.05, 0.10,…,1.0, where C1 and C2 are the chi-square statistics observed in stages 1 and 2 respectively. For each λ we estimate DP, and we present the maximal P over the 21 values ofλ, together with the corresponding λopt. If several values maximize DP, we define λopt as their average.
For the simulations, we assume that 8000 cases and 8000 controls are available to be apportioned between stages 1 and 2 with equal numbers of cases and controls in each stage. We present data for M0 = 1 and 10 and for πsample = 0.125, 0.25, 0.50, and 1.0, both for fixed effects and random effects models. For each odds ratio, we conducted 12 independent simulation studies, each with 10,000 replications, to estimate DP for each combination of M0, πsample =0.125, 0.25 and 0.50 and fixed and random effects models. We allow for realistic random variation in minor allele frequency (MAF). In each simulation and for each of the T0 = 500,000 SNPs, we randomly select a MAF from the distribution in controls in CGEMS (https://caintegrator.nci.nih.gov/cgems/), truncated to the interval [0.05, 0.5]; the mean MAF is 0.2673, which is corrected from 0.2763 in (Gail et al., 2008). For each SNP, a chi-square value is generated that depends on the MAF, β, and numbers of cases and controls(Gail et al., 2008). Letting I (m, ISIM) =1 if disease SNP m is among the T1 SNPs selected after stage 1 and among the T2 SNPs selected after stage 2 and 0 otherwise, we estimate(Gail et al., 2008) DP as
We note from the exchangeability of the disease SNPs that DP can be interpreted either as the probability that a particular disease SNP will be selected at stage 2 or as the proportion of disease SNPs selected at stage 2(Gail et al., 2008). The proportion positive, PP, can be estimated(Gail et al., 2008) as
We performed these simulations in GAUSS(Aptec Systems, 2005).
The following results are for the chi-square test based on the Wald statistic. Very similar results were found for the chi-square version of the score test (data not shown). We present detailed information for the fixed effects model with odds ratio 1.2 per allele in Table 1. With M0 =1, P was 0.882 or higher for the one-stage design (πsample =1.0), for various values of T1 (shown under T2 in Tables). For πsample =0.5, P was slightly reduced in the replication analysis, but the joint analysis yielded P closer to that of the one-stage design. The λopt ranged from 0.30 to 0.45, indicating that the optimal combination usually put more weight onC2, even when n1 = n2. For πsample = 0.25 or 0.125, P was considerably reduced for the replication analysis, and the joint analysis yielded little if any increase in P. For example, for πsample = 0.125, T1 =25,000 and T2 =10, P was 0.936 for the one-stage analysis, 0.657 for the replication analysis, and 0.658 for the joint analysis. λopt was smaller for πsample = 0.125 or 0.25 than for 0.50. In five of the six scenarios presented in Table 1 for M0 =1, λopt was smaller for T2 =100 than for T2 = 1 or 10. For M0 =10, similar results were obtained except for T2 =1. For T2 =1, all designs and analyses yielded a P near 0.100, because the ten disease SNPs competed against each other for selection. For T2 = 10 or 100, results for M0 =10 were very close to those for M0 =1.
The proportion positive, PP, can also be estimated for β = log(1.2) from Table 1 and equation (2) assuming M0 is known. For M 0 =1, T1 =25,000 and T2 =10, P =0.065 for πsample =0.125 and 0.093 for the one-stage design. If M0 =10, the corresponding P values are 0.654 and 0.898. For M 0 =1, T1 =25,000 and T2 =1, P =0.630 for πsample =0.125 and 0.882 for the one-stage design.
For the random effects model with τ = (π/2)1/2 log(1.2) ≈ 0.2284 and with M0 =1(Table 2), P ranged from 0.550 to 0.646 for the one-stage design. With πsample = 0.5, P was modestly reduced for the replication analysis, but only slightly reduced for the joint analysis. λopt ranged from 0.25 to 0.45. For πsample = 0.25 or 0.125, P was considerably reduced for the replication analysis, and the joint analysis yielded little if any increase in P. For example, for πsample =0.125, T1 =25,000 and T2 =10, P was 0.599 for the one-stage analysis, 0.465 for the replication analysis, and 0.465 for the joint analysis. λopt was smaller for πsample =0.125 or 0.25 than for 0.50, and λopt was usually smaller for T2 =100 than for T2 = 1 or 10. For M0 =10, results were very close to those for M0 =1 except when T2 =1. With T2 =1, the M0 =10 disease SNPs competed against each other, reducing P to near 0.100 for all designs and analyses. As for the fixed effects model, application of equation (2) shows that P increases with increasing M0.
To examine how much the DP of the two-stage design is reduced compared to the one-stage design for fixed effects models with odds ratios per allele of 1.1, 1.2, 1.3 and 1.5, we plotted P for the two-stage design and joint analysis against the P for the corresponding one-stage design with 8000 cases and 8000 controls for πsample =0.125 and M0 =1 (Figure 1). The bold loci correspond to T1=25,000 for T2=1, 10 or 100, and the unbolded loci correspond to T1=1,000 for T2=1, 10 or 100. The points on each locus correspond from left to right to odds ratios 1.1, 1.2, 1.3 and 1.5. The vertical distance from the equiangular line is the decrease in estimated DP from using the two-stage design. For example, for odds ratio 1.2, and for T1=25,000 and T2=100, the vertical distance is 0.966−0.664=0.302; for T1=1,000 and T2=100, the vertical distance is 0.966−0.269=0.697, as can also be seen from Table I. The absolute decrease in P for two-stage designs is substantial for odds ratios 1.1, 1.2 and 1.3, but diminishes for odds ratio 1.5, as both designs attain P approaching 1.0. Except for T2=1, very similar graphs are obtained for M0 =10 (not shown).
For the random effects model with M0 =1 and πsample =0.125, decreases in P for the two-stage design are appreciable for standard deviations of log odds ratios of (π/2)1/2 times log(1.1), log( 1.2), log(1.3), and log(1.5) (Figure 2). Note that compared to the corresponding fixed effects models, P is smaller for both the one-stage and two-stage designs. Except for T2=1, very similar graphs are obtained for M0 =10 (not shown).
For the fixed effects model with M0 =1 and πsample =0.25, the decreases in P for the two-stage design are smaller than for πsample =0.125 (compare Figures 3 and and1).1). Nonetheless, the decreases in P remain appreciable for odds ratios 1.1 and 1.2 (Figure 3). Except for T2=1, very similar graphs are obtained for M0 =10 (not shown).
For the random effects model with M0 =1 and πsample =0.25, the decreases in P for the two-stage model are smaller than for πsample =0.125 (compare Figures 4 and and2).2). The decreases in P are small for T1 =25,000, but, for T1 =1,000, remain appreciable for standard deviations of log odds ratios of (π/2)1/2 times log(1.1), log( 1.2), log(1.3), and log(1.5) (Figure 4). Except for T2=1, very similar graphs are obtained for M0 =10 (not shown).
An executable pre-complied GAUSS program is available from the first author to estimate DP and PP for two-stage GWASs.
We studied the detection probability (DP) of a two-stage GWAS design, that is, the chance that a given disease-associated SNP will have among the lowest ranks of p-values (or highest ranks of chi-square statistics) at stages 1 and 2. Our data for fixed effects models indicate that the DP from a two-stage design with πsample ≤0.25 and 8000 cases and controls can be substantially less than that of the corresponding one-stage design with the same numbers of cases and controls for odds ratios per allele of 1.1, 1.2, and 1.3, which are typical of statistically significant odds ratios found in recent large GWASs. For the range of values T1≤25,000 that we studied, a “joint” analysis cannot appreciably increase the DP of the two-stage design if πsample ≤0.25. Similar results are found for corresponding random effects disease models, which yield somewhat smaller DPs. To achieve an adequate DP, the first stage must have enough cases and controls to assure that a high proportion of disease-associated SNPs have among the T1 lowest p-values at stage 1. Thus, if 16,000 cases and controls were available for study, choosing πsample ≤0.25 would yield acceptable DP, as seen from calculations for the one-stage design(Gail et al., 2008). Except for settings where enormous numbers of cases and controls are available for study, however, designs with πsample ≤0.25 should be avoided. Software is available from the first author to allow researchers to study other parameter values and sample sizes.
Our data suggest that additional stage 1 genotyping in most previous studies with πsample ≤0.25 will yield additional promising SNPs and that future multistage designs should not use πsample ≤0.25, unless the numbers of available cases and controls are very large. Other considerations also favor using larger values of πsample. As the cost per genotype of chips designed for stage 1 decreases relative to that for specialized platforms for subsequent stages, cost considerations argue for larger values of πsample (Skol, 2007, Wang et al., 2006). The costs of obtaining cases and controls also favor a larger value of πsample (Müller et al., 2007).
The two-stage ranking and selection procedure analyzed in this paper differs from two-stage procedures that apply the same fixed critical values to data from each SNP and are designed to select promising SNPs in stage 1 and provide a final p-value for testing an association following stage 2, as in (Skol et al., 2006). In particular, the two-stage ranking and selection procedure does not attempt to control the overall p-value, but only to obtain a very promising set of T2 SNPs at the end of stage 2. Despite these different goals and methods, Figure 2 in (Skol et al., 2006) shows that power diminishes appreciably, and cannot be retrieved by joint analysis, if πsample ≤ 0.20 and πmar ker T1/T0 ≤ 0.1, in agreement with our findings for DP.
In some circumstances, power calculations can be used to approximate DP. For a one-stage design with M0 = 1, equation (2.6) in (Gail et al., 2008) shows that DP can be approximated by the power that corresponds to an hypothesis test with size α = T1/T0. Although power calculations performed in this way and extended to the two-stage design may approximate the DP under certain conditions, the results in (Skol et al., 2006) were not based on such significance levels and critical values. We illustrate these differences using the program provided by (Skol et al., 2006) at http://csg.sph.umich.edu. For πsample =0.125, πmar ker T1/T0 = 25, 000/500, 000 = 0.05, genome-wide false-positive rate 0.05, which corresponds to α = 0.05/500, 000 = 10−7 for the joint analysis, a single fixed minor allele frequency of 0.2673 for all SNPs, and an assumed disease prevalence of 0.10, this program yields power estimates of 0.83 for the replication analysis and 0.83 for the joint analysis. The corresponding critical value for a normal deviate for stage 1 is 1.96, and for stage 2, the critical values are 4.611 for the replication analysis and 5.189 for the joint analysis. For the ranking procedure with M0 =1, T1 = 25, 000 and T2 = 1, 10, or 100, the DP was 0.63, 0.65, and 0.66 respectively for the replication analysis and 0.64, 0.66 and 0.66 respectively for the joint analysis (Table 1) in the realistic setting in which minor allele frequencies are drawn from the distribution in CGEMS, with mean 0.2673. If instead it was assumed that all SNPs had the same minor allele frequency, 0.2673, as was assumed for the power calculations, the corresponding estimates of DP were 0.74, 0.75 and 0.75 for the joint analysis. These calculations illustrate that there are quantitative differences between assessments based on power and those based on detection probability, even though the broad conclusion that πsample should not be too small is supported by both analyses.
It is worthwhile to recount some differences between power and detection probability. Power is the probability that the test statistic for a given SNP will fall into the pre-determined critical region for a one- or two-stage design that is chosen to control a genome-wide significance level, as for example in (Skol et al., 2006). Power thus depends on the chosen significance level; DP depends, instead, on T0, T1, and T2. The power to reject the null hypothesis for a given SNP does not depend on the test statistics for any other SNP; DP depends on the test results for all SNPs. Power does not depend on the number of disease-associated SNPs, M 0; DP can be sharply reduced by competition among disease-associated SNPs, especially if T2 is less than M 0. Most power calculations assume that all disease-associated SNPs have the same minor allele frequency; DP calculations routinely allow for allele-frequencies to be drawn from a realistic distribution of allele minor frequencies. DP estimates the probability that a given disease-associated SNP will have among the smallest T2 p-values at the end of a two-stage study; power, as routinely calculated, does not have this interpretation. If disease-associated SNPs are exchangeable, as we assume in the fixed effect and random effect models (see METHODS), DP also has an interpretation as the proportion of disease-associated SNPs that have among the smallest T2 p-values at the end of a two-stage study. Thus, the estimate of DP can be used to estimate how many more disease-associated SNPs with similar odds ratios to those already found are likely to be discovered by conducting another study with a larger stage1 sample.
Satagopan and colleagues (Satagopan et al., 2004, Satagopan et al., 2002) also studied ranking procedures to identify disease-susceptibility SNPs in two-stage designs, but used different rank-based criteria from DP and also assumed that the disease allele was known. The two-sided versions of a Wald test or a score test that we used have the same value whether one counts major or minor alleles, and hence are particularly appropriate for GWASs(Devlin & Roeder, 1999, Pfeiffer & Gail, 2003).
The ranking and selection methods used in this paper depend on the assumption that tagging SNPs are independent (Gail et al., 2008), an assumption that is also widely used in power calculations, e.g. (Skol et al., 2006). Zaykin and Zhivotovsky (Zaykin & Zhivotovsky, 2005) analyzed different ranking criteria for one-stage designs and found that selection probabilities were little affected by correlations of p-values within linkage disequilibrium blocks or among such blocks. In unreported simulations in which non-disease associated SNPs were paired and each member of the pair assigned the same chi-square value (perfect correlation within pairs), we found almost no effect on the estimates of DP and PP, compared to the situation in which SNP gentotypes are independent. Thus it is likely that our estimates of DP are robust to local correlations among tagging SNPs.
In view of the potential losses in DP in multistage designs and trends in costs favoring large values of πsample, the one-stage design becomes increasingly attractive. Another advantage of the one-stage design is that it yields data that can readily be used in meta-analyses. For example, if a preliminary study identifies a particular SNP as associated with disease, data from independent one-stage studies can be used to test the association and provide an unbiased estimate of the corresponding odds ratio. Later stages in a multistage design would provide no information if that SNP had not been tested in the later stages.
The Intramural Research Program of the Division of Cancer Epidemiology and Genetics, National Cancer Institute supported this work. We thank the reviewers for comments that improved the paper.