‘Big’ bioscience is critically poised. It is now known that genetic associations with complex diseases can reliably be detected and replicated if sample sizes are large enough. This will fuel international investment in biobanking. But, how far should that investment go?
It is essential to close the ‘reality-gap’ that currently exists between the sample sizes really required to detect determinants of scientific interest that have plausible bioclinical effects, and the sample sizes that are typically used when studies are being designed. Extensive theoretical work has been undertaken to explore statistical power and sample size in human genomics. This includes studies of the effect of genotype misclassification on power,59–62
and of strategies for power optimization for genetic main effects63–64
and for gene–gene65
and gene–environment interactions.66
Furthermore, the effect of measurement error in both outcomes and exposures on statistical power of gene–environment interaction studies has been explored thoroughly for quantitative traits
But, previous work on the power of case–control analyses (i.e. binary traits
) has not addressed the impact of realistic assessment errors in both exposures and outcomes and the impact of unmeasured aetiological determinants. The important original contributions of the current article are 3-fold, therefore: (i) to extend the classes of analytic complexity addressed in a straightforward simulation-based power calculation engine; (ii) to use this calculator to undertake realistic
sample size calculations for a class of analyses (case–control analyses with unavoidable assessment errors in both exposures and outcomes) that will be utilized widely over the next few decades—analyses in this class are the least powerful that are likely to be applied commonly, and the resultant calculations, therefore, provide a valuable guide to study design in the many large-scale biobanks that are currently being conceived and launched; (iii) to alert readers, particularly those setting up new biobanks, to a web-based implementation—ESPRESSO (Estimating Sample-size and Power in R by Exploring Simulated Study Outcomes)—of the power calculator used in this article and to provide detailed information (in Supplementary methods
) about the mathematical models on which it is based.
Studies enrolling several hundred subjects are commonplace in human genome epidemiology. But, even conventional power calculations67
indicate that 400 cases and 400 controls provide <1% power to detect (at P
< 0.0001) an OR of 1.4 for a binary ‘at risk’ genetic variant with a general population frequency of 0.0975 (e.g. a dominant risk-determining allele with MAF = 0.05). There can be no doubt that a study involving several hundred cases and controls demands hard work and is large by historical comparison; nevertheless, the reality
is that to generate a power of 80%, such a study would actually require 4000 cases and 4000 controls.
But even these figures substantially understate the challenge that really faces us. Conventional power calculations ignore many aspects of analytic complexity. Using ESPRESSO, the R-based55
simulation-based power calculator68
jointly developed by P3
G, PHOEBE and UK Biobank (www.p3gobservatory.org/powercalculator.htm
), such complexities can be taken into proper account. Using this approach to mimic the conventional power calculation (above)—i.e. assuming disease and genotype to be assessed without error and no heterogeneity in disease risk—confirms a requirement for approximately 4000 cases and 4000 controls. But the sensitivity and specificity of the diagnostic test ought to be taken into proper account: e.g. 0.891 and 0.974, respectively, for a published screening test for type 2 diabetes based on glycosylated haemoglobin56
(see Supplementary box 2
). Genotyping error must also be considered. It may, for example, be reasonable to assume that this corresponds to incomplete LD with an R2
Finally, heterogeneity in disease risk might be reflected in an assumed 10-fold ratio in the risk between subjects on high (95%) and low (5%) centiles of population risk. Having built in these assumptions, the required sample size more than doubles
to 8500 cases and 8500 controls.
It might be argued that substantial power could be gained if a multiplicative model based on additive allelic effects [Supplementary methods
, equations (H and I)], as in WTCCC,14
were used instead of a binary genetic model (Supplementary box S1
). Statistical power will be increased if there is a systematic gradation in the strength of association across the three genotypes defined by two alleles. This may reflect biological reality, or it may arise as an artefact of the decay of incomplete LD when working with a linked marker rather than a causative variant. But (Supplementary Figure S5
), the reduction in required sample size (typically, 5–50%) is only substantial for SNPs with a common minor allele. This is because when the minor allele is rare, subjects homozygous for the minor allele will be very rare
and the genetic determinant will effectively act as if it were a binary exposure. But, power limitation is less of a problem for SNPs with a common minor allele and so the impact of moving to a valid multiplicative genetic model is less dramatic than might otherwise be assumed.
One of the landmark genomic studies of 2007 was the WTCCC that reported robust ‘hits’ in seven of eight complex diseases in its main experiment.14
But the basic design—involving 2000 cases and 3000 controls for each disease—seems, at first sight, to be at the lower limit of required sample size as implied by our calculations (). Therefore, it is tempting to conclude either that the WTCCC was lucky or that our calculations are overly conservative. But, the main experiment of WTCCC had a number of design features that contrast with the assumptions of the primary power calculations reported in our article.14
The most relevant of these are: (i) use of a model invoking an additive genetic effect rather than a binary ‘at-risk’ genotype; (ii) cases rigorously phenotyped so that few, if any, non-diseased subjects will have appeared as cases; (iii) a P
-value threshold of 5 × 10−7
; (iv) a case : control ratio of 2 : 3; (v) no phenotyping of controls so diseased subjects will have contaminated the controls to an extent determined solely by general population prevalence. On the basis of simulations that invoke all of these assumptions, Supplementary Figure S6
presents the precise equivalent to , but uses the design parameters of the WTCCC. On the basis of their own simulation-based power calculations (incorporating errors consequent upon incomplete LD),14
the design team of the WTCCC estimated that its power would be ‘43% for alleles with a relative risk of 1.3, increasing to 80% for a relative risk of 1.5’.14
These power calculations were based on averaging across all MAFs > 0.05,14
and the design should, therefore, be underpowered for SNPs with an uncommon minor allele and to have more power than required for common SNPs.14
Our methods ([Supplementary Figure S6
]) concur that for SNPs with a MAF in the range 0.2–0.5, the WTCCC design was well powered to detect heterozygote ORs14
of 1.3 or greater and that even ORs as low as 1.2 should have been detected with a non-negligible probability (power ~9%). On the other hand, the power to detect rarer SNPs (MAF = 0.05–0.1) with ORs <1.5 should have been low. Without knowing which SNPs are truly associated with which complex disease, or how strong those associations might be, it is impossible to use the empirical evidence to precisely
quantify how accurately our approach predicts the power of the WTCCC. But, the predicted power profile is certainly consistent with the results reported in Table 3 of the WTCCC paper.14
Three of the 19 SNPs they identified as having a ‘significant heterozygous OR’, had a MAF between 0.05 and 0.1 and these all
had OR > 1.5. In contrast, 13 had a MAF between 0.2 and 0.5 and of these, four exhibited an OR < 1.3 (1.19–1.29), five an OR between 1.3 and 1.5 and four an OR > 1.5. SNPs with a rarer minor allele are typically most common,57
and if power was not a substantial issue, one would have expected more ‘hits’ to arise in rare SNPs. It is true that the observed ORs would have been subject to the ‘winner's curse’,69
but this does not detract from the consistency of the overall pattern that was found. As a second test of its validity, the ESPRESSO model was then used to estimate the power of the WTCCC to reconfirm the effect of 12 (non-HLA) loci that had been ‘previously robustly replicated’.14
On the basis of the published bioclinical characteristics of these 12 variants (Supplementary Table S5
), our simulations predicted a 24% probability that all 12 would replicate and probabilities of 44, 26 and 6%, respectively, that 11, 10 or ≤9 would replicate. These predictions are closely consistent with the published WTCCC analysis in which 10 of the 12 actually replicated.14
Of course, these analyses provide no more than a rudimentary check of the calibration of our approach, nevertheless, it is encouraging that the predictions appear sensible and it would, therefore, seem reasonable to apply the methods to new problems, including those involving environmental as well as genetic determinants.
The fact that our power estimates appear consistent with those of the WTCCC team itself suggests that any additional elements of analytic complexity that were addressed by our methods had a limited impact on required sample size in this particular setting. Therefore, we explored the relative contribution to increased sample size requirement that was consequent upon those specific elements of our model that are not included in a conventional power calculation. Across an arbitrary, but not atypical, set of models incorporating a gene–environment interaction (see Supplementary materials, and Supplementary Table S6
), it was found that it is a realistic level of error in assessing the environmental determinant that was most influential in inflating the required sample size. But, the WTCCC analysis focused solely on genetic main effects and so this was irrelevant. Furthermore, all cases in WTCCC were carefully phenotyped. Specificity was, therefore, close to 100% and very few, if any, healthy subjects would have appeared as cases (Supplementary materials and Supplementary Figures S2a
and S2b). Finally, the sophisticated power calculations undertaken by WTCCC took appropriate account of error arising from incomplete LD, and so the only additional factor that did
come into play in the WTCCC was heterogeneity in underlying disease risk—but, on its own, this has little impact (Supplementary Table S6
Can biobanks ever be large enough?
Although our methods are in accord with the power calculations undertaken by the WTCCC and suggest that it was appropriately powered to detect the effects that it set out to study, larger—sometimes much
larger—sample sizes will be required () to reliably detect: (i) ORs at the lower end of the plausible range; (ii) SNP effects associated with rarer minor alleles; (iii) genotypic effects that are binary rather than multiplicative in nature; (iv) gene–environment (or, gene–gene) interactions or (v) aetiological effects in case series subject to less exhaustive phenotyping. If bioscience aims to rigorously investigate such effects, it will be necessary to design studies enrolling not thousands, but tens of thousands of cases. But, studies of such a size should not be contemplated unless relative risks ≤1.5 are really worth investigating. A central aim of modern bioscience is to understand the causal mechanisms underlying complex disease49,70 and each quantum of new knowledge has the potential to provide an important insight that may have a dramatic impact on disease prevention or management. This implies that scientific interest may logically focus on any causal association that can convincingly be identified and replicated—it need not be ‘strong’ by any statistical or epidemiological criterion. The fundamental need, therefore, is for research platforms to support analyses powered to detect plausible aetiological effects. But, what does this mean? The majority of genetic effects on chronic diseases that have so far been identified and replicated are characterized8,9,13–32 by allelic or genotypic relative risks of 1.5 or less—many in the range 1.1–1.3. Effect sizes may be greater for causal variants than for markers in LD, but it would be unwise to assume that the gain will necessarily be substantial. Although the search for ‘low hanging fruit’ must continue, therefore, we agree with Easton et al.22
that much of the future harvest will be rather higher up the tree. But, even if they are of scientific interest, can ORs ≤1.5 reliably be detected by any
observational study? In 1995, Taubes argued that: ‘[observational epidemiological studies]… are so plagued with biases, uncertainties, and methodological weaknesses that they may be inherently incapable of accurately discerning … weak associations’.71
Fortunately, several of the central arguments underlying this bleak assessment do not hold in human genome epidemiology. Randomization at gamete formation renders simple phenotype–genotype associations robust to life style confounding and to uncertainty in the direction of causality—in other words, enhanced inferential rigour is a direct, but wholly fortuitous, consequence of what is often called Mendelian randomization.70,72–74
At the same time, the increasing accuracy and precision of measurements in genome epidemiology14,53,54
mean that—in the absence of intractable confounding and reverse causality—sufficient statistical power can realistically
be accrued to draw meaningful inferences for small effect sizes. Despite important caveats,70,73,75
therefore, small effects reflecting the direct impact of genetic determinants (main effects and gene–gene interactions) or the differential impact of genetic variants in diverse environmental backgrounds (gene–environment interactions) are more robust than their counterparts in traditional environmental epidemiology.
Finally, we note that the primary simulations that underpin our conclusions are all based on a case: control ratio of 1 : 4, while a 1 : 1 ratio was adopted in considering the ‘conventional’ power calculations (see above). Furthermore, most of the case–control studies that we reference (including the WTCCC) are based on ratios that are much closer to unity.13–32
But this presentation was deliberate. Given access to a fixed number of cases and an unrestricted number of well-characterized controls, substantial additional power can be obtained using a design based on four or more controls per case (Supplementary Figure S1
). In the future, the existence of massive population-based biobanks such as UK Biobank48
and extensive sets of nationally representative controls (e.g. as in WTCCC14
) will mean that designs based on multiple controls will be highly cost effective and will be widely used. It would, therefore, have been inappropriate to present power calculations based primarily on the 1 : 1 design as this would have increased the estimated sample size requirement, thereby strengthening our main message in a manner that could have been seen as misleading. On the other hand, in exploring the implications of conventional power calculations (see above), most contemporary work is based on designs with approximately equal number of cases and controls and it was, therefore, felt to be more intuitive for readers to focus on designs of this nature. For the sake of completeness, Supplementary Figures S7
, S8a and S8b replicate , A and B but use equal numbers of cases and controls.
To finish, we note that the basic conclusions we have reached are stark and may appear disheartening. But, pessimism is unwarranted. Disentangling the causal architecture of chronic diseases will be neither cheap nor easy and it would be unwise to assume otherwise. But it has the potential to return investment many-fold with future improvements in promoting health and combating disease. Therefore, it is encouraging that several international case–control consortia have already managed to amass sample sizes of the magnitude that is realistically required.16,19,22,26 Furthermore, the largest contemporary cohort-based initiatives48,49,51,52 will generate enough cases to study the commonest diseases in their own right (). To take things further, three complementary strategies will markedly enhance the capacity to study plausible relative risks right across the spectrum of complex diseases: (i) improve the accuracy and precision of measurements and assessments;14,53,54 (ii) increase the size of individual studies and biobanks8
and (iii) harmonize protocols for information collection, processing and sharing10,46–49,70
). Taken together, these actions will provide for a powerful global research platform to drive forward our understanding of the causal architecture of the common chronic diseases. But, such a platform will be of little value unless power calculations are both accurate
. It is our hope that this article and access to ESPRESSO will be viewed as providing valuable guidance to those setting up individual biobanks and designing the case–control analyses to be based upon them.