All statistical analysis models for pooled case-control data are built on the probability distribution of allele frequency
which depends upon depth (
), chromosome number per pool (s
) and treatment group. Different assumptions about the distribution of
lead to different statistical tests. The binomial model used in Fisher’s exact test assumes
is constant across pools within the same treatment group [see Equation (2)
in the Supplementary Material
], while the extra-binomial variation models used in this article assume
is a continuous variable that varies from pool to pool.
Note that allele frequencies estimated by sequencing pools of PCR-amplified DNA may be biased, as the PCR may preferentially amplify one allele compared to the other (Sham et al., 2002
). This means that allele frequencies presented in 1 and may deviate from the true population frequency, which can be an issue when considering the sample size required to replicate findings for rare SNPs, in particular. As this bias is likely to affect cases and controls equally when there is no underlying difference in allele frequency, it cannot be expected to inflate Type 1 error, although it may act to amplify or depress any true difference between cases and controls depending on which allele is subject to bias.
Other methods allowing for the variance in pooled sequencing have been explored. Wang et al. (2010)
estimated allele frequencies by weighting sequencing read counts using a weight inversely proportional to the variance of the allele frequency estimate and compared these weighted estimates between cases and controls. However, no published software is available to implement this approach. An alternative approach adopted by Kim et al. (2010)
is to estimate allele frequency by maximum likelihood and then use a likelihood ratio test to test the MAF difference between cases and controls. However, this method is affected heavily by pool size. As the authors pointed out, their approach is not applicable to pools with more than five individuals per pool. Given the main purpose of pooled sequencing is to decrease the experimental cost, a large pool size may be expected in real experimental data.
Here, instead of specifying the statistical distribution of
, we propose a ‘quasi-likelihood’ approach which only defines the mean and variance of
. The two EB models in this article share the same mean functions of
with different variance functions where Williams’ EB1 model adopted one parameter,
, for each SNP i
and our EB2 model adopted two universal parameters, a
, for all SNPs.
The EB models have the advantage of including between-pool variation, compared to the standard binomial model which ignores heterogeneity of pools and thus provides a poor fit of our sequencing data (). However, several shortcomings are exposed in EB1: (i) the significance test is based on increases in the
statistic, not a statistically powerful approach (McCullagh, 1983
); (ii) parameter estimation is complicated requiring iteratively re-weighted least squares (see Section C of the Supplementary Material
) and (iii) the small sample size (only 20 data points for each SNP i
cannot be estimated precisely.
In EB2, we assumed that the parameters in the variance function are constant across all SNPs tested. By doing this, we expanded the number of data points for parameter estimation which allowed more efficient estimation and our use of GLM means it may be fitted in standard software. This could, of course, be extended to include other SNP-specific features by changing the form of Equation (8)
, but no other features were found to be predictive of error structure in our 454 dataset.
The improvement of EB2 over EB1 is obvious in the Q–Q plot—the slope declines to 1.26 ( versus a). Given the fact that of the 10 re-sequenced genes, 6 were in known T1D regions, we expect some SNPs would be genuinely associated with T1D and therefore expect the slope to be greater than one.
The simulation results showed that across different situations, our preferred EB2 model has an overall excellent control of Type 1 error rate. This is in contrast to Fisher’s exact test, where the observed Type 1 error rate may be as much as 9-fold the nominal rate at
In our 454 data, base error rates were very low. However, we included a wider range of errors in our simulations to allow us to examine the potential for our model to be applied to other technologies. Because our model only deals with increased variance, and not systematic bias induced by base calling errors, we undertake a separate error correction step which should correct the error in expectation, but may itself increase variance. This can be seen in the dependence of Type 1 error rate on the sequencing error for Fisher’s exact test, whereas our model appears to have a good control of Type 1 error rates across the range of sequencing errors considered, despite both methods being applied to the same, error-corrected, data. Note that any estimates of error rates are likely to be uncertain. Our experience is that the effect of correcting for a poorly estimated error rate in simulated data depends on MAF, the magnitude of the true error and how poorly it is estimated. In the case of our 454 data, we are lucky to be dealing with low estimated error rates which are consistent with previous estimates for this technology. Correction for these estimated error rates made no substantial difference to our conclusions. However, if the estimated error rates in other applications were larger, it might be wise to conduct some sensitivity analysis, varying the estimated error rate by factors of two or more, to examine whether associations identified are robust to misspecification of error rates.
A shortage of any model used here is the assumption of binomial error within each pool, which allows a zero count for major alleles. This is in contrast to our SNP detection principle which requires at least two supporting reads at a position to call a SNP (i.e. we should model
for some k
). When we removed our detection criteria and instead examined all SNPs in dbSNPs version 128 (Sherry et al., 2001
) within our 144 target regions, the Q–Q plot showed further improvement (b). Additionally, a much smaller maximum sample quantile was observed in dbSNP Q–Q plot, indicating there are still errors among the full set of 473 called SNPs.
Overall, extra-binomial models appear to have better properties than the naive binomial model. They could analyse a larger range of variants with lower or more variant pool depths and our new EB2 model is more appropriate and easier to apply compared with the EB1 model proposed by Williams (1982)
. Work such as this is used as the basis for further confirmatory experiments and we intend to follow up the four new SNPs identified in . More accurate results lead to better targeting of these experiments and thus faster and more efficient progress to identify the causal genes in T1D.