We first assume a cost regime where the overhead cost of a subject is negligible comparing to that of the sequenced bases. Under fixed total cost and therefore affordable total bases (
T) is fixed, genome-wide average depth coverage λ is determined by the number of cases to be sequenced (
N1) : λ=
T/
N1. For a rare variant, we assume homozygous carriers are rare enough to be ignored. The power of correctly calling a heterozygous carrier from sequencing data is a function of λ :
R(λ ), usually with an approximate sigmoid. It can be determined from empirical data (
Supplementary Fig. S1) or approximated analytically (
Wendl and Wilson, 2009;
Wheeler et al., 2008).
In the presence of allelic heterogeneity, we assume there are
M rare causal variants within a gene (or pathway). These alleles are independently distributed in the population, with respective carrier frequencies
h1,…,
hM among cases. The carrier status for any of these rare variants is a Bernoulli variable with compound carrier frequency
p=
R(λ)(1−
i=1m(1−
hi)) in cases. Let
F1![[equivalent]](/corehtml/pmc/pmcents/equiv.gif)
∑
i=1Mhi, then
p≈
F1R(λ) when
hi<<1, and the number of observed carriers
K1 among
N1 cases follows a binomial distribution with parameters (
N1,
F1R(λ)). An economic design is to sequence cases only and utilize publicly available data for additional
N0 samples as controls. For example, we use
N0=400, which represents a lower bound of the sample size from one major population of the 1000 Genomes Project (
1000 Genomes Project Consortium et al., 2010). Given
T,
N0,
F1 and
R(λ), we provide an online tool (OPERA) that can calculate the power of detecting association from different values of
N1 (and λ) under certain Type I error cutoff (
Q). As an example, we show the results with resources of
T=1000× genomes under a simplistic assumption that the number of carriers in controls (
K0) is 0 (). The power curve takes the unusual shape of a sawtooth wave that reflects discrete artifacts (
Supplementary Material) of the test statistics for association of rare variants (
Supplementary Figs S2 and S3), superimposed on a smoothed concave curve through the respective tips of the sawteeth. Moving along the
x-axis, the smoothed curve initially increases with
N1 when
N1 is relatively small, reflecting the fact that when
N1 is small, λ is large, therefore
R(λ) is close to 1 and changes very little as
N1 increases. For larger sample sizes and lower coverage, where
R(λ) is away from 1, increasing
N1 further reduces λ and
R(λ) sufficiently to negate the benefit of increasing
N1, thus power starts to decrease from one sawtooth tip to the next. We present similar results with
K0=1 (
Supplementary Fig. S4a), which can represent singleton variants in controls sequenced at low coverage. In general, for a gene with compound carrier frequency
F0 of rare functional variants in controls, the expected power is the sum of power under possible
K0 values weighted by the probability of observing
K0. We show the power curve with hypothetical
F0=0.001 (
Supplementary Fig. S4b).