|Home | About | Journals | Submit | Contact Us | Français|
Markers for individual genotyping can be selected using quantitative genotyping of pooled DNA. This strategy saves time and money.
To determine the efficacy of this approach, we investigated the bivariate distribution of association test statistics from pooled and individual genotypes. We used a sample of approximately 1,000 samples with individual and pooled genotyping on 40,000 SNPs.
We found that the distribution of the joint test statistics can be modelled as a mixture of two bivariate normal distributions. One distribution has a correlation of zero, and is probably due to SNPs whose pooled genotyping was unsuccessful. The other distribution has a correlation of approximately 0.65 in our data. This latter distribution is probably accounted for by SNPs whose pooled genotyping accurately predicts the underlying allele frequency. Approximately 87% of the data belongs to this distribution. We also derived a method to investigate the effect of both the correlation and selection cut-off on the relative power of pooling studies. We demonstrate that pooled genotyping has good power to detect SNPs that are truly associated with disease-causing variants for SNPs showing good correlation between pooled and individual genotyping. Therefore, this approach is a cost effective tool for association studies.
DNA pooling has been used as a cost effective method for screening markers for more than 20 years . Proof of principle studies has been undertaken and has demonstrated the accuracy of the estimation of allele frequencies in pooled DNA on platforms that allow researchers to simultaneously analyse hundreds of thousands of markers [2,3,4,5,6]. The relative efficiency of different platforms has also been compared . Furthermore, promising results have been gained from two-stage case control studies of a number of different diseases which have been undertaken using pooled DNA to screen for interesting markers and individual genotyping to confirm results [8,9,10,11,12,13,14,15,16,17]. It is known that measurement error from pooled studies is higher than measurement error in individual studies, and that differential hybridisation can make allele frequency estimation more difficult, but continuing efforts are being made to improve on the accuracy of the data [6,18,19].
There have also been a number of papers that have attempted to establish the efficiency of pooled DNA analysis. The study of Barratt et al.  was one of the earliest: they used estimates of the different variances to approximate the effective sample size. Although important work is continuing into the nature of the variation underlying these studies, these parameters are platform dependent and will continue to change as platforms are modified. More recently, Prentice and Qi  have developed methods to consider the efficiency of pooling based on the ability to identify SNPs with a certain odds ratio. Pearson et al.  looked at the efficiency of pooling using simulation to deal with the difficulties of accounting for linkage disequilibrium (LD).
In this paper we investigate the relationship between the association test statistics derived from the quantitative genotypes from pools and individual genotypes. We fit a mixture of bivariate normal distributions to the joint statistics and determine the correlation and relative proportions of these distributions. We go on to investigate the relative power of pooling studies to correctly identify markers that show association in individual genotyping studies. In addition, we investigate the effect of the cut-off used for marker selection, the correlation between the association statistics from the two types of study, and the p value of the marker in the individual genotyping study.
Approximately 2.4 million SNPs were genotyped in pools by Perlegen Sciences using custom Affymetrix chips. The SNPs were divided into 49 chips with about 50,000 SNPs on each chip. The sample of 482 cases and 466 controls was split into 8 case and 8 control pools containing about 60 people each. Details of pool construction, pooled allele frequency estimation, association tests of the pooled data, SNP selection, and genotyping methodology for the individual genotypes, are outlined in the paper by Bierut et al. ; hence only a summary of this information is provided here.
DNA concentration of each individual was normalised and equimolar amounts were included in the pools.
Allele frequencies were determined from the ratio of the intensity of the reference strand to the sum of intensities from both the reference and alternate strand. Trimmed means from perfect-match (PM) and mis-match (MM) intensities were both utilised. The three quality control measures used were designed to ensure that (i) the PM has a higher intensity signal than the MM, (ii) the signal to noise ratio was acceptably low, and (iii) no features were saturated.
Corrected t tests were used to evaluate association. This method involved correcting the standard error by a chip-specific additive constant to ensure there was not biased selection of SNPs with small standard errors. Empirical p values were calculated using the rank of the corrected t test value divided by the number of successful markers on the chip.
Only those SNPs in which at least two pools of both the cases and controls had allele frequency estimates that met the quality control standard required were considered for selection. The main selection criteria was a p value below 0.05; however, some SNPs were selected for population stratification and quality control. 41,402 SNPs were individually genotyped: 39,213 of these were selected from the pools and 2,189 were selected for QC and population stratification. After QC we were left with 29,314.
Genotyping was performed with a clustering technique which was a combination of K-means and multiple linear regression. There was a stringent quality control procedure for the individually genotyped data. The call rate had to be above 80%, HWE p values from both cases and controls were considered, and extra checks relating to the X and Y chromosomes were undertaken. Furthermore, a metric was derived from 15 inputs relating to SNP quality and genotype, which correlated with the probability of having a discordant call (this was calculated using a non-Perlegen HapMap project genotypes). Monomorphic SNPs were also excluded from the analysis. Allele-wise analysis was carried out to derive p values for association.
We examined the relationship between the association test statistics rather than the allele frequencies. While some investigators prefer to study the allele frequency, we believe the association test statistics is more relevant for determining the efficacy of using pooled versus individual genotyping for an association study. That is, if a SNP would have been significant under individual genotyping, would it meet the pooling threshold for individual genotyping and thus be detected in a pooling design? This is the focus of our analyses.
For each p value resulting from the tests carried out on both the quantitative genotypes derived from the pools and the individual genotypes, we determined the value of the standard normal deviant for a two-sided test (X pool and X igt). We ensured that the direction of deviation was comparable for the pooled and individual genotyping.
After selecting only those SNPs that had been genotyped using both pooled DNA and individual DNA, we fitted a mixture of bivariate normal distributions to the distribution of the joint X statistics. One of the bivariate normal distributions – defined as α – was fixed to have a correlation of zero because it would comprise of data from ‘bad’ SNPs, i.e. SNPs with allele frequencies that can not be accurately estimated from pools.
For the proportion (1 – α) of ‘good’ SNPs we assumed that the joint distribution (X pool, X igt) had a bivariate normal distribution, with X pool and X igt having mean 0, variance 1 and correlation r between them. (‘Good’ SNPs are those with allele frequencies that can be accurately estimated from pools). If all SNPs were typed both ways, then the joint density is the standard bivariate normal function ϕ (r), and r could be computed in the standard way.
However, the statistics are only available from SNPs selected for individual genotyping (X pool, X igt | X pool S), where S is the set of SNPs chosen for individual genotyping. The density for this distribution is K ϕ (r), where the constant K does not depend on the correlation r and corresponds to the selected area under the marginal density of X pool.
The final likelihood is given by:
(1 − α) Kϕ(r) + α Kϕ(r = 0).
The parameters r and α may then be estimated by maximizing the likelihood over the observations.
Established characteristics of both standard and bivariate normal distributions can be used to determine the ability to identify markers that would demonstrate association in an individual genotyping study from markers genotyped in pools. Essentially, we calculated the probability of a marker being selected for the individual genotyping study given its expected standard deviate in that study (X igt), as well as the cut-off used for marker selection (t) from the pooled study and the correlation between the distributions of the test statistics from the pooled and individual genotyping results (r).
We derived the expected standard deviate of t given both X igt and the correlation of the distributions. We then determined the probability of X pool being above this value.
In the study by Bierut et al.  the correlation of the allele frequency estimates had been calculated as 0.87 for the cases and 0.84 for the controls. The distribution of X pool was approximately normal, as expected. It had a mean of 0.019 and a standard deviation of 1.000.
We demonstrated that the bivariate distribution of the joint test statistics from the case-control tests of pooled and individual data can be modelled as a mixture of normal distributions with proportions α and 1 – α (n = 29,314). In our data the proportion α with correlation zero equalled 0.133 and the correlation of the 1 – α proportion of the distribution was estimated to be 0.65 (± 0.0022). We propose that some markers are not typed successfully in pools and it is the results from these that comprise the 13% of the data that has no correlation. Such markers pass QC of the pooled data but do not accurately reflect the true allele frequency. This could be due a number of problems such as extreme differential hybridisation or non-specificity of the probe. Differences allele frequency estimates could be used to ascertain the identity of ‘bad’ SNPs, however such information is platform specific with limited relevance to the wider community hence is not presented here.
Figure Figure1a1a shows a simulated mixture of bivariate normal distributions at correlation 0 and 0.65 with proportions 0.13 and 0.87. This demonstrates the pattern we would expect to observe when we plot our data. Figure Figure1b1b is the plot of our data and although the data points from the two distributions are not distinguishable, a similar pattern is seen.
We also analysed data on which a stricter criterion for selection was used – i.e. markers in which at least 7 case and control pools passed QC, α decreased to 0.077, and the correlation of the other distribution increased to 0.67 (± 0.0021) (n = 24,879). This suggests more stringent use of quality control statistics could increase the overall accuracy of test statistics from the analysis of pooled genotypes. From figure figure22 we can see that 7 or more pools pass QC in the vast majority of markers and that over 90% of markers pass QC in all pools.
We also looked at the correlation in the 286 stratification SNPs in which 7 or more of both the case and control pools had passed QC. The correlation was 0.72 (± 0.023). (There was no evidence of stratification in these samples.)
When looking at the proportion of markers we expect to successfully select for individual genotyping, the patterns we would expect to find are present:
These statements are all illustrated in figure figure33 where the proportion of markers selected is plotted against minus log 10 of the expected p value of a causative SNP. Different data series represent three different correlations and two different cut-offs. For simplicity the proportion of the SNPs with an uncorrelated bivariate distribution are not included in the figure but would have an impact on overall selection.
Although these statements are self evident, our method is a useful tool for researchers to derive actual estimates of relative power for pooling studies.
We demonstrate that the bivariate distribution of the joint test statistics from the case-control tests of pooled and individual genotypes can be modelled as a mixture of normal distributions. We also demonstrate that for ‘good’ SNPs pooling has essentially the same power to detect SNPs that are truly associated with disease-causing variants and is therefore a cost effective tool for association studies. If, for example, we would require a p value of 10 −8 to deem a SNP significant genome-wide in a GWAS study, such a SNP would meet the 0.05 cut-off for individual genotyping in the pooled analysis over 99% of the time. Accordingly, for a given power P (determined by effect size, sample size and analysis used), the power would be 0.99P if the same sample size and SNP density is used. ‘Bad’ SNP's are unlikely to be identified using pooling approaches but with high SNP density signals can be picked up by markers in high LD.
The importance of good quality control for pooling studies is also evident as it increases the correlation of the test statistic. Furthermore, we provide a statistical method that will allow researchers to evaluate a suitable cut-off for selection, taking into account information specific to their laboratory set-up and the aims of their study. Alongside the work of others who have investigated different aspects of pooling methodologies, our research demonstrates the usefulness of this technique.
Our sample was particularly useful for the purposes of the study. We had a large amount of good quality data. More than 40,000 markers had been genotyped in pools and individually; of these, about 30,000 met the high QC criteria we demanded. Because 295 SNPs were analysed before the pooling stage to assess population stratification, we could also be confident of the quality of each individual sample. Although we would have been able to use the individual genotyping data to estimate differential hybridisation (k) for each SNP we did not. Our aim was to investigate the power of pooling in circumstances where k would be unknown for example on most custom built chips used to follow up linkage regions.
Despite the proven utility of pooling, it does provide less information, hence less power, than individual genotyping. As a consequence, the usual motivation to pool is the reduction in cost of the overall experiment, although accessibility of certain equipment can also play a role. There are massive set-up costs for the technologies that genotype hundreds of thousands of markers at one time; many groups do not have such equipment and would need to outsource samples for this kind of analysis. Although extra work is required to ensure the exact consistency of the concentration of all DNA in a pooling study, it can be undertaken on machinery that is cheaper and more commonly available. This may make it more feasible for a laboratory to pool DNA and outsource the pools than outsource all their samples individually.
Cost has been evaluated in a number of studies but due to the continual changes in both the available technology and its price, cost is a constantly changing variable. Furthermore, researchers have to consider whether any given saving outweighs the disadvantages of pooling and this will depend on many factors.
In determining the best approach one also has to consider what the goals and possibilities of the screening procedure are, bearing in mind the reduced relative power of a pooling study in relation to an individual genotyping study. The number of markers being investigated, along with any knowledge of their expected effect size, should impact the decision making process. If a researcher has a small sample size but would like to undertake a genome wide association study (GWAS), and if pooling is the only way they can afford to do it, they would need to consider the implication on power very carefully. Placed alongside the necessity of correction for multiple testing, pooling may reduce power to a level where any money spent on a GWAS is a waste, making it more prudent to use individual genotyping to undertake an extensive candidate gene study. If a GWAS is being performed on a large sample like those being gathered by the various biobanks, pooled DNA analysis would still have considerable power. The number of markers needing to be followed up could alter the method and subsequent cost of the individual genotyping study. For example, the current costs of Illumina technology mean the cost of a custom chip with 4000 markers is the same as the cost of the 300K chip, hence pooling is only cost effective when significantly fewer markers are followed up. If pooling is to be undertaken, there is a lot of information available on how to do it most efficiently, including a review by Sham et al.  and a recent paper by MacGregor . The latter looks at issues of variability, suggesting that although multiple pools are useful, there are diminishing returns when one gets to a certain point.
The main disadvantages of pooling are the increase in measurement error and a decrease in the ability to undertake secondary analysis or control for multiple covariates. In a GWAS – where hundreds of thousands of markers are typed – the increase in measurement error would not be as much of a problem as in the context of a tag SNP study. The former would result in considerable redundancy of information from markers in complete or very high LD. This redundancy is also useful when one considers the need for strict quality control. Furthermore, new methods are still developing to help keep the measurement error to a minimum in all studies.
If pooling studies are designed carefully, some level of secondary analysis can be undertaken. For example, one can construct pools based on covariates and incorporate this information into the analysis using meta-regression techniques . This could be important because of the documented effect of a certain covariate or because the DNA has come from different sources (buccal vs. blood). However, this approach is limited by the number of pools it is feasible to create, as well as by the nature of the covariate. Categorical variables are easy to work with but quantitative ones are not, and something like admixture, which may require assessment by individual genotype data, would be impossible to use without extensive preliminary analysis. Genotype and haplotype analysis can only be undertaken in very small pools, reducing the cost-effectiveness and perhaps the usefulness of the method . Depending on the suspected genetic architecture of the trait, such secondary analysis may not be considered important.
In summary, we suggest that pooling studies are a useful screening technique in some situations, and we provide an approach for researchers to investigate their power in this context.
JK was supported by an MRC Bioinformatics Advanced Training Fellowship, Grant No. G0501329.
The NICSNP project is a collaborative research group and part of the NIDA Genetics Consortium. Subject collection was supported by NIH grants CA89392 (PI-L Bierut) from the National Cancer Institute and DA012854 (PI-P Madden) from the National Institute on Drug Abuse. Genotyping work at Perlegen Sciences was performed under NIDA Contract HHSN271200477471C. Phenotypic and genotypic data are stored in the NIDA Center for Genetic Studies (NCGS) at http://nidagenetics.org/. under NIDA Contract HHSN271200477471C (Pls J Tischfield and J Rice).
JP Rice, DG Ballinger and SF Saccone are listed as inventors on a patent (US 20070258898) held by Perlegen Sciences, Inc., covering the use of certain SNPs in determining the diagnosis, prognosis, and treatment of addiction.
We would like to thank Laura Bierut, Nick Martin and Stuart MacGregor who assisted in the preparation of this manuscript.