Home | About | Journals | Submit | Contact Us | Français |

**|**Bioinformatics**|**PMC3530907

Formats

Article sections

Authors

Related links

Bioinformatics. 2013 January; 29(1): 29–38.

Published online 2012 October 27. doi: 10.1093/bioinformatics/bts645

PMCID: PMC3530907

Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA

*To whom correspondence should be addressed.

Associate Editor: Michael Brundo

Received 2012 April 4; Revised 2012 August 27; Accepted 2012 October 24.

Copyright © The Author 2012. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com

This article has been cited by other articles in PMC.

**Motivation:** For many complex traits/diseases, it is believed that rare variants account for some of the missing heritability that cannot be explained by common variants. Sequencing a large number of samples through DNA pooling is a cost-effective strategy to discover rare variants and to investigate their associations with phenotypes. Overlapping pool designs provide further benefit because such approaches can potentially identify variant carriers, which is important for downstream applications of association analysis of rare variants. However, existing algorithms for analysing sequence data from overlapping pools are limited.

**Results:** We propose a complete data analysis framework for overlapping pool designs, with novelties in all three major steps: variant pool and variant locus identification, variant allele frequency estimation and variant sample decoding. The framework can be used in combination with any design matrix. We have investigated its performance based on two different overlapping designs and have compared it with three state-of-the-art methods, by simulating targeted sequencing and by pooling real sequence data. Results on both datasets show that our algorithm has made significant improvements over existing ones. In conclusion, successful discovery of rare variants and identification of variant carriers using overlapping pool strategies critically depend on many steps, from generation of design matrixes to decoding algorithms. The proposed framework in combination with the design matrixes generated based on the Chinese remainder theorem achieves best overall results.

**Availability:** Source code of the program, termed VIP for Variant Identification by Pooling, is available at http://cbc.case.edu/VIP.

**Contact:**
ude.urwc@ilgnij

**Supplementary information:**
Supplementary data are available at *Bioinformatics* online.

In the past few years, genome-wide association studies have successfully identified many genetic variants associated with complex human diseases and traits (McCarthy *et al.*, 2008; Wellcome Trust Case Control, 2007). However, many of those variants identified by genome-wide association studies can explain only a small proportion of inherited risk (Manolio *et al.*, 2009). One possible explanation is that genetic variants with low minor allele frequencies (MAFs) and rare variants contribute a substantial fraction of the missing heritability (Altshuler *et al.*, 2010). Although common genetic variations such as single nucleotide polymorphisms (SNPs) have been well studied and characterized through international collaborative effort [e.g. the HapMap project (Altshuler *et al.*, 2005; Frazer *et al.*, 2007)], investigation of low MAFs or rare SNPs is much harder. First of all, large sample sizes are required to discover and type rare SNPs. Furthermore, before a rare SNP has been discovered, array-based technologies cannot be used to type individual genotypes from samples. Only very recently, with the development of next generation sequencing technologies, the international community has been able to systematically survey variants with low MAFs using large number of population samples through collaborations (Thousand Genomes Project Consortium, 2010). Results from the 1000 genomes project (Thousand Genomes Project Consortium, 2010) about variant locations, types and allele frequencies lay a solid foundation in studying relationships of genotypes and phenotypes. However, the cost for a project at this scale (e.g. sequencing >2000 samples with deep coverage) is still prohibitively high for most individual investigators, even though sequencing costs have been continuously decreasing. On the other hand, for case–control-based studies on complex diseases and traits, thousands of samples are normally required for each study to achieve a reasonable power.

An effective strategy to reduce the overall cost is to pool DNA sequences from different individuals together and then to sequence the pooled DNA with high coverage. It not only capitalizes on the continually decreasing sequencing costs per se but also can effectively reduce the costs associated with DNA preparations and target capturing for targeted sequencing projects. The cost for target capturing is proportional to the number of samples (i.e. number of individuals without pooling versus number of pools with pooling) and remains stable over the years. Pooling strategies can save tremendously on sample preparations. Furthermore, for targeted sequencing projects, depending on the size of targeted regions and coverage, researchers may not be able to fully take advantage of instruments’ capacity if sequencing one individual at a time, even with multiplexing using barcodes. Inspired by this strategy, both wet lab experiments using pooling (Calvo *et al.*, 2010; Momozawa *et al.*, 2011; Nejentsev *et al.*, 2009; Out *et al.*, 2009) and methodologies serving pooling (Bansal *et al.*, 2010; Druley *et al.*, 2009; Kim *et al.*, 2010; Lee *et al.*, 2011; Rivas *et al.*, 2011; Vallania *et al.*, 2010; Wang *et al.*, 2010; Wei *et al.*, 2011) have emerged recently. Among them, Syzygy (Calvo *et al.*, 2010; Rivas *et al.*, 2011; Rivas and Daly, 2011) has been used in several real sequencing projects that used the pooling strategy. However, the main limitation of the naive pooling strategy is its inability to detect variant carriers, which is of high importance for disease-association studies of rare variants. Multiplexing using barcodes can partially solve the problem (Nijman *et al.*, 2010; Smith *et al.*, 2010), but the cost is still high because of limited barcoding capacity each run and per sample cost for DNA preparation and target capturing.

A promising alternative, so-called overlapped pooling designs, has been proposed by several groups (Erlich *et al.*, 2009; Prabhu and Pe’er, 2009). The basic idea is rooted in combinatorial designs. By allocating individuals into a small number, but different pools, it is possible to identify samples that carry variants based on their pool signatures. Overlapped pooling designs represent a very important and economically feasible approach to discover rare SNPs and to identify rare mutant carriers. By doing so, one can avoid the two-step approach where rare variant loci are first discovered by sequencing and then genotyped using customized arrays. However, the studies of such an important strategy have been limited so far. There are mainly two existing algorithms: DNA Sudoku (Erlich *et al.*, 2009) and the logarithm design (Prabhu and Pe’er, 2009) (the latter is termed Overlap Log in this study). DNA Sudoku allocates samples to pools (i.e. to construct a design matrix) based on the Chinese remainder theorem (CRT) (Ding *et al.*, 1996) and identifies variant carriers with the help of combinatorial group testing theory (Du and Hwang *et al.*, 2000). However, DNA Sudoku has several limitations. First, it assumes that all variant loci are known as a prior, therefore, cannot be used in detecting novel rare variants. Second, its SNP calling algorithm, which is based on the ratio of counts of variant alleles and reference alleles, does not take sequencing errors into consideration, which can significantly affect calling results for rare variants. Finally, it does not provide variant allele frequency (VAF) estimation, which is commonly used in testing associations for case–control studies. The Overlap Log (Prabhu and Pe’er, 2009) constructs its design matrix based on the binary representation of integers (sample identification numbers). Although such a design is ‘optimal’ in terms of the number of pools required, it cannot effectively identify variant carriers, mainly because it allocates too many individuals in each pool.

To address these issues, we propose a complete data analysis framework for overlapping pool designs, which consists of several steps. The framework, termed VIP, for Variant Identification by Pooling, is very flexible and can be combined with any pool design approaches and sequence mapping/alignment tools. Our major contributions include algorithms for variant pool and variant locus identification, VAF estimation and decoding of variant samples (Fig. 1). To identify variants in pooled samples, we propose a log likelihood ratio statistic to estimate the variant allele ratio (VAR) in each pool. Theoretical analysis further reveals the relationship between the statistical power of the test to detect variants of certain frequencies and sequencing coverage (i.e. read depth) and error rate. Variant loci can then be declared based on variant pool identification. The VAF of each variant locus can consequentially be estimated based on a weighted average of the estimated VAR of all pools. Finally, a set of algorithms are proposed to identify variant carriers. To evaluate the effectiveness of VIP, we have performed extensive experiments by simulating a targeted-sequencing project with different number of samples (500 and 1000), different design strategies and different per-sample sequence coverage. In comparison with the two existing approaches (i.e. DNA Sudoku and Overlap Log), VIP has made significant improvements in all steps. For example, for variant pool and variant locus identification, F-scores (the harmonic mean of precision and recall) have been increased from (0.03–0.33) to (0.94–0.99) for various datasets. In particular, VIP Sudoku (VIP in combination with Sudoku design matrixes) achieves best overall results. To further evaluate the performance of VIP under more realistic settings, we have created another dataset by pooling the real sequence data from the 1000 Genome project (Thousand Genomes Project Consortium, 2010) using the Sudoku design and compared its performance with another newer algorithm Syzygy (Calvo *et al.*, 2010; Rivas *et al.*, 2011; Rivas and Daly, 2011). Syzygy was recently developed for the non-overlapping pool design and has been used in real sequencing projects (Calvo *et al.*, 2010; Rivas *et al.*, 2011). It can detect variant pools/loci and estimate allele frequencies among other functionalities. However, Syzygy was not designed for overlapping pool strategies, and it cannot directly decode variant carriers. To apply Syzygy on this new dataset, we have combined Syzygy with some of the strategies in VIP for allele frequency estimation from overlapped pools and variant carrier decoding. The modified program, termed VIP_Syzygy, was then compared with VIP on this dataset. Results show that VIP has better performance than VIP_Syzygy in all tasks performed, including variant pool/variant locus detection, allele frequency estimation and variant carrier identification.

The first step to use overlapping pool strategies is to construct a design matrix, specifying which pool contains which sample. In this subsection, we will first introduce some notations and then discuss two existing pool design strategies based on DNA Sudoku and Overlap Log, which will also be used in this study. A pool design can be represented by a design matrix ; , where *t* is the number of pools, *n* is the number of samples, indicates that sample *j* is assigned to pool *i*. The *j*-th column of matrix *M* (denoted as *M _{j}*) represents the pools in which sample

The logarithm design (Prabhu and Pe’er, 2009) is based on binary representations of integers (sample IDs). More specifically, for each integer from 1 to *n*, one can use a bitword with a size of for its binary representation. For sample *j*, the bitwords column vector representation of *j* − 1 and that of *n* − *j* together form the *j*-th column of *M*. The total number of pools is thus equal to . The logarithm design is a balanced design with column weight and row weight *n*/2.

The Sudoku design (Erlich *et al.*, 2009) is based on the CRT (Ding *et al.*, 1996). For *n* individuals, it first finds integers , such that , and are pairwise co-prime. It then creates *w* groups of pools. The group corresponding to *x _{i}* has

An example to illustrate the Sudoku design and our decoding strategies. The design matrix is for 15 samples with nine pools. It has two groups of pools (*w* = 2) with *x*_{1} = 4 (for pools *a–d*) and *x*_{2} = 5 (for pools *A–E*). Samples 1, 5, 9 and **...**

Once a design matrix is given, DNAs from all samples will be pooled according to the design matrix and will be sequenced. Sequence reads can be aligned using existing mapping tools [e.g. MAQ (Li *et al.*, 2008)], and the observed results for each pool at a specific locus are the number of reads with the reference allele, the number of reads with variant alleles, as well as read mapping quality scores and base quality scores, which will serve as the input data for subsequent analysis.

At each sequence position/locus, we consider whether a pool contains any samples with variant alleles. For this study, we assume there are at most two possible alleles/nucleotides at each locus. Let *R*_{1}* _{i}* denote the number reads with the variant allele and

Assume that each allele from an individual in a pool is independently sampled with an equal probability, then all the reads are independently, identically distributed, with the probability of a read having a variant allele under *H*_{0} being ν* _{i}* =

(1)

Under *H*_{0}, because *ER* is assumed known, there are no parameters, and the maximum value in the numerator of Equation (1) equals to . Under *H*_{1}, there is only one free parameter μ* _{i}*. Its maximum likelihood estimate is

(2)

When the number of reads is large enough (), the log likehihood ratio statistic follows a χ^{2} distribution with one degree of freedom (Mood *et al.*, 1973). Pool *i* is regarded as a variant pool if *H*_{0} is rejected for a specific significance level. And the locus can therefore be regarded as a variant locus. In our analysis, because for each locus, we need to perform the hypothesis test for tens to hundreds of times depending on the number of pools, we use the false discovery rate (FDR) (Benjamini *et al.*, 2001) for multiple testing corrections.

Because the number of alleles in a pool is a discrete value, one can also estimate μ* _{i}* as a discrete variable. Suppose the number of samples in pool

For each locus, the asymptotic power (1 −β) of the LRT statistic () can be derived theoretically, as a function of the significance level α, the error rate *ER*, the VAR μ* _{i}* and the locus coverage

(3)

where is the critical value corresponding to α under *H*_{0}, is the non-central χ^{2} distribution with 1 degree of freedom. The non-centrality parameter δ can be represented as a function of locus coverage *N _{i}*, error rate

To obtain the required locus coverage *N _{i}* to achieve a desired power of 1 -β under the alternative

(4)

where is the percentile of the non-central χ^{2} distribution with one degree of freedom and non-centrality parameter δ. Given β and α, the non-centrality parameter δ can be numerically calculated using existing software package to satisfy Equation (4). The value of *N _{i}* can be readily obtained based on this equation: .

For real sequence data, it is difficult to estimate the sequencing error rate accurately. We propose to use read base quality scores provided by sequencing data to approximate locus specific error rates. More specifically, for each locus, and for pool *i*, let *Q _{j}*

(5)

where ν_{j}_{,}* _{i}* is the probability of read

In many downstream applications such as association analysis, allele frequency is one important measure that is commonly used to compare different groups (e.g. cases versus control subjects). It will be desirable if one can reliably estimate allele frequencies using pooling designs. A simple approach to obtain frequencies of variant alleles is to take advantage of the estimated fraction of variant alleles in the identified variant pools at each variant locus (i.e. μ* _{i}* in Equation 2). Given individuals in pool

Once a site is predicted as a variant locus, the next step is to predict which samples carry variant alleles at this site. Pooling designs in principle are (only) effective for variant sample predictions when variant allele frequencies are small, which is usually the case in many studies that seek rare causal disease variants. We discuss this problem in two settings: in an ideal case, all the variant pools are correctly called; in practice, mistakes from variant pool calling (false positives and false negatives) do occur and need special treatment. For the case with no errors, DNA Sudoku (Erlich *et al.*, 2009) has proposed a pattern consistent calling based on the *d*-disjunct theory (Kautz and Singleton, 1964), denoted as strategy S1 here. It basically says that if the number of individuals with variants is smaller than or equal to the decoding robustness (*d*) (Erlich *et al.*, 2009; Kautz and Singleton, 1964), then a sample carries a variant if and only if all the pools including this sample as a member are variant pools. When the number of individuals with variants is greater than the decoding robustness, the condition becomes necessary but not sufficient, i.e. when a sample carries a variant, all of its pools must be variant pools; while a sample may not carry any variants even if all its pools are variant pools. In practice, one does not know the number of samples with variants at a locus. One strategy to estimate this number is to use the largest number of samples with all their pools being variant pools. The set of all such samples (denoted as *B*) is a super set of all variant samples. Based on *B*, we propose another strategy, denoted as S2. If there is a variant pool that only contains one sample from *B*, then the variant allele(s) must be from that sample. Therefore, that sample must carry variant allele(s). Figure 2 provides an example to illustrate decoding strategies S1 and S2. The effectiveness of the aforementioned two strategies for variant sample identification critically depends on the accuracy of variant pool calling. When there are false negatives in pool identification, it is possible that at some loci, the number of variant pools is less than the minimum of column weights, and some variant pools have no identified variant samples (using S1 and S2) associated with them. We propose two boosting algorithms to address these issues (details of the two algorithms can be found in Supplementary Section 1.2).

To generate dataset I, we simulate a targeted sequencing project of a region of size 100 kbps with different coverage (10–30×/individual in each pool). For convenience, we randomly choose one region (Chr7: 27 124 046–27 224 045) from the ENCODE project (Birney *et al.*, 2007) because ENCODE regions have been extensively studied, and information about SNP variants (including rare SNPs) is more reliable in those regions. The reference sequence and SNP information in this region are obtained from NCBI’s databases. SNP VAFs are either directly available or can be derived based on SNP heterozygosity information. Genotypes of each individual are generated based on VAFs assuming Hardy–Weinberg equilibrium. Once the two homologous sequences of an individual are generated, they are randomly sheared into short reads with a fixed length. This is done independently for different pools. The number of reads is determined by the expected coverage per sample. To be more realistic, we also randomly introduce sequencing errors in the reads at a fixed rate of 0.005. The short reads are then aligned to the reference sequence using MAQ (Li *et al.*, 2008).

To generate dataset II, we use a subset of real sequence data from the Pilot 3 project of the 1000 Genomes Project (Thousand Genomes Project Consortium, 2010), which has performed high-coverage (56× in average) sequencing of 697 samples from seven populations on 8140 exons from 906 randomly selected genes by multiple institutions using multiple platforms. We only use data generated using Illumina platforms and genotypes called by one institution (i.e. Boston College), which limits the number of samples to 364. We select the largest 41 exons (≥1600 bp) on chromosome 11, with the total length roughly equal to 100 kbps. Only reads within these regions are retained. Reads from potential PCR duplications are removed by applying Picard MarkDuplicates (http://picard.sourceforge.net). Finally, only top 300 samples with larger numbers of reads are used in generating dataset II. The genotypes called from individual samples by the 1000 Genomes Project team (i.e. Boston College for these samples) are regarded as ground truth.

We only focus on the Sudoku design in generating dataset II because it provides better decoding capacity than the logarithm design. When choosing weight six for the 300 samples, the Sudoku design yields a matrix of 145 pools. To generate reads for each pool, we randomly sample reads from each individual assigned to the pool. By doing so, local coverage variation is expected to mimic real sequence data. The total number of reads sampled from each individual in a pool is determined by the total throughput of the pool assuming that all the individuals in the pool contribute equal amounts of reads. The total throughput of the design is determined based on a specified expected individual coverage. It is then averaged out across pools (Supplementary Section 1.3). Reads for different pools are generated independently. Therefore, when an individual participates in multiple pools, different reads from the individual may be assigned to different pools. Data generated by this process are more similar to real data than dataset I and have many features that mimic real data.

For dataset I, we have created two groups of samples: one with 500 individuals (dataset I.1) and the other with 1000 individuals (dataset I.2). For dataset I.1, there are total 1685 variant loci, about half of which (804) can be regarded as rare SNPs (sample VAF ≤ 0.05). The smallest VAF is 0.001 and the largest one is 0.537. For dataset I.2, there are total 1694 variant loci, and 803 of them with sample VAF ≤ 0.05. The smallest VAF is 0.0005 and the largest one is 0.526.

To evaluate the performance of VIP on dataset I, we adopt two different pool design strategies (i.e. Sudoku design and the logarithm design, denoted as VIP Sudoku and VIP Log) and compare the results with the methods in the original article [denoted as DNA Sudoku (Erlich *et al.*, 2009) and Overlap Log (Prabhu and Pe’er, 2009)]. Because their codes are not available, we implement DNA Sudoku and Overlap Log according to their descriptions in their corresponding articles. We adopt the same set of parameters as those in the articles. For dataset I.1, we use a fixed per individual coverage (10×), and for dataset I.2, we use two different coverages (10 and 30×). For dataset I.1, the logarithm design creates 18 pools, and its column weight equals to nine. By choosing the column weight of seven, Sudoku design yields 210 pools. For dataset I.2, the logarithm design creates 20 pools, and its column weight equals to 10. For Sudoku design, we use the same the column weight of seven, and it yields 270 pools.

For dataset II, the total number of SNPs called by Boston College is 783, and 679 of them with VAF ≤ 0.05. The smallest sample VAF is 0.0017. The largest VAF is 1.0000. Only Sudoku design is used. By choosing the column weight of six, Sudoku design yields 145 pools. We use a fixed per individual coverage that is roughly the same as the original data (55×). Based on the results on dataset I, we only compare VIP and VIP_Syzygy on this dataset. To consider the base quality scores based on Equation (5), we choose to use the recalibrated scores by the program GATK (DePristo *et al.*, 2011; McKenna *et al.*, 2010), which are expected to be more reliable than the original ones. We further extract allele counts information using SAMtools’ mpileup utility tool (Li and Handsaker, 2009) using base quality score 10 and mapping quality score 20 as filters. We use the same filtering parameters for Syzygy for fair comparison and keep all its other parameters as default.

Because variant pool identification is the foundation of the proposed framework, in that the identification of variant loci is a direct application of pool identification and variant sample decoding relies heavily on the prediction accuracy of variant pools, we first evaluate the performance of the proposed log-LRT in variant pool identification. Although VAF affects the detection of variant pools, a more direct factor is actually the VAR in each pool, which is affected by both VAF and the number of individuals in each pool. Furthermore, the capacity of the LRT also depends on the sequencing coverage and error rate as illustrated in Section 2.2.2. Because the theory is based on asymptotic analysis, in this subsection, we use *in silico* experiments to illustrate how practical the theoretical relationship between power and other factors is in real cases. Because the accuracy of the LRT for pool identification does not affected by pool designs, for this experiment, we use three non-overlap pool datasets based on dataset I.1, with 25, 50 and 125 pools, respectively. The number of individuals in each pool is 20, 10 and 4, respectively. The minimum VAR in each pool is 0.025, 0.05 and 0.125, respectively. In general, results show that the experimental and theoretical power of LRT is consistent for VAR ≥ 0.05. We also observe that for VAR ≤ 0.05 and for a fixed coverage, the practical power is much higher than the theoretical one, indicating that the theoretical prediction may be somewhat conservative. Owing to page limitation, more details can be found in Supplementary Section 2.1.

In this and next two subsections, we discuss the performance of VIP Sudoku, VIP Log, DNA Sudoku and Overlap Log on dataset I, and the performance of VIP Sudoku and VIP_Syzygy Sudoku on dataset II in terms of variant pool/locus identification, VAF estimation and variant sample identification.

Table 1 summarizes the results of variant pool and variant locus identification on dataset I and II. For dataset I, because the two designs (Sudoku and logarithm) differ very much in terms of the number of pools, the total numbers of pools with variants across all loci differ very much (from 30 K to 33 K for data using the logarithm design to ~250–354 K for data using Sudoku design). Regardless of design strategy, sample size or coverage, VIP achieves very high precisions (0.9998–1) and very high recalls (0.8925–0.9907). In contrast, DNA Sudoku is very conservative in calling variant pools. Although its precision is always one, it has very low recalls (0.1893–0.2). Such low sensitivity in detecting variant pools directly results in its low sensitivity to detect variant loci and to identify variant carriers (to be discussed later). On the contrary, Overlap Log has very high recalls (0.9825–0.991) but with very low precisions (0.0164–0.017). Essentially for almost every locus and every pool, Overlap Log calls it as a variant pool, which will create many false positives when identifying variant loci and variant carriers. The comparable performance of VIP on Sudoku and logarithm designs shows that with proper algorithms, variant pool identification is not affected by different designs, as long as the per-sample coverage is kept the same. With higher coverage, the recall of identifying variant pools by VIP Sudoku has increased significantly. The increase of VIP Log is moderate, whereas little improvements have been observed for DNA Sudoku or Overlap Log. For locus identification, similar trends exist for all the approaches. In particular, VIP Sudoku achieves the best overall results with F-score ≥0.99 for all cases (Table 1).

For dataset II, Table 1 shows that VIP performs better than Syzygy both in terms of precision and recall on variant pool identification, as well as on variant locus detection. The difference in F-scores for variant pool identification is ~5% (0.8366 versus 0.7855), and the difference is >20% for variant locus identificavtion (0.8531 versus 0.6287). On the other hand, the performance of VIP on dataset II is not as good as its performance on dataset I, mainly because of variability of locus coverage in dataset II (Supplementary Fig. S4A). We therefore examine variation of VIP’s performance for different coverage and different VARs, by grouping the pools according to these two measures and calculate the F-Score of variant pool identification for each group. Results (Supplementary Fig. S4B) show VIP has lower F-scores for pools that either have very low coverage or have very low VARs.

Because DNA Sudoku does not provide a component for estimating variant allele frequencies, we choose to show the comparison of VIP Log, Overlap Log and VIP Sudoku on estimating VAFs of variant loci on dataset I (Fig. 3A and B), and VIP Sudoku and VIP_Syzygy Sudoku on dataset II (Fig. 3C). Overall, all the approaches are highly accurate in estimating VAFs for both datasets, with the *r*^{2} between the estimated and real VAFs >0.99 on dataset I, and *r*^{2} of 0.958 and 0.888 for VIP Sudoku and VIP_Syzygy Sudoku, respectively, on dataset II. On dataset I, all the approaches have extremely high accuracy when VAF ≥0.01. For loci with smaller frequencies, the variance gets somewhat bigger relative to the frequencies themselves, which is common for statistical estimation of small values. Also, for loci with small VAFs, the estimated VAFs by VIP Log and VIP Sudoku are less than the real ones, whereas the estimated VAFs by Overlap Log are greater than the real ones. This can be explained by the conservative variant pool identification by VIP while Overlap Log has reported many more false positives on loci with low VAFs. On dataset II, the performance of VIP Sudoku is a little worse than its own performance on dataset I. This is mainly because dataset II contains higher level of noise than dataset I. However, VIP still outperforms Syzygy and has a better *r*^{2}.

To evaluate the performance of different approaches on variant sample identification, we only consider the true variant loci. Because VIP Sudoku and VIP Log recall almost all variant loci with high precision on dataset I, this consideration will not affect VIP algorithms much. It does not affect DNA Sudoku, nor does it affect VIP or Syzygy much on dataset II. On the contrary, this treatment is in favour of Overlap Log because the majority of false positives from miscalled variant loci are ignored. Also, only results using the decoding strategy S1 in combination with the two boosting algorithms are presented in the main text. The two boosting algorithms actually greatly enhance the decoding effectiveness of strategy S1. Owing to space limitation, the results using strategy S2 on top of the results from strategy S1 are provided in Supplementary Material. Table 2 shows the precisions, recalls and F-scores of variant sample detection on variant loci of different VAFs. For dataset I, the three algorithms (VIP Sudoku, VIP Log and Overlap Log) have almost the same performance when VAF ≥ 0.2. In such a case (high VAFs), these algorithms basically called a significant majority of samples as variant carriers, therefore their recalls reached 1.0, whereas precisions were only ~0.63. The only algorithm that behaves differently is DNA Sudoku, in which case, its recalls are ~0.50–0.59, whereas precisions are from 0.77 to 0.83. This is a consequence of DNA Sudoku’s conservative calling of variant pools. For the same reason, DNA Sudoku cannot identify any variant samples when VAF ≤ 0.2. For VAFs in the range of (0.05, 0.20), VIP Log and Overlap Log obtained the same results, although VIP Log has much better performance in variant pool identification. This basically shows that the logarithm design is not an effective design in terms of variant sample identification. It cannot correctly predict variant samples even when almost all variant pools are correctly predicted, mainly because the number of individuals in each pool is too huge. In the same range, VIP Sudoku achieves comparable recall with the highest precision. For VAF in (0, 0.05), VIP Log and Overlap Log have good recalls but extremely low precisions. In contrast, VIP Sudoku has a much more balanced recall and precision. When the sample size increases from 500to 1000, the trends for different algorithms are different in different VAF intervals. For example, for VAFs in (0, 0.05), both the precision and the recall of VIP Sudoku get lower when increasing the sample size. This is mainly because with the same VAF, the absolute number of variant samples gets large for a large sample size, which makes decoding harder. This point will be further illustrated in Figure 4, when we examine the performance of the algorithms for rare variants at a finer scale. Keeping the sample size the same and increasing the coverage from 10× to 30×, the recalls get improved for most cases, but precisions are a little worse or unchanged.

Variant sample identification by different algorithms on rare variant loci. The subgraphs represent the precisions (top panel) and recalls (bottom panel) for 500 samples with coverage of 10× (A and E), 1000 samples with coverage of 10x (B & **...**

To further examine the performance of different algorithms on variant samples identification for rare variants, Figure 4 shows the precisions and recalls on loci with VAF < 0.05 with a higher resolution. From the figure, a fact is that VIP Sudoku achieves much higher precision than any other methods on dataset I. VIP Log and Overlap Log have similar performance in most cases. In general, they have higher recalls when VAF ≥ 0.02, but their precisions are very low. For VAF = 0.01, where it is more likely that the number of samples with variants is smaller then the decoding robustness, VIP Sudoku also has higher recalls than VIP Log and Overlap Log (except the case 1000 samples 10× coverage). For VIP Sudoku, both precisions and recalls decrease when the sample size increases to 1000 from 500. This is also owing to the fact that for the same VAF, the number of samples with variants gets larger for a larger sample size. When increasing coverage from 10× to 30×, the recalls of VIP Sudoku increase with comparable precisions. Also, with the increase of VAF, the recalls of VIP Sudoku gets lower when VAF ≥ 0.02. Overall, VIP Sudoku performs the best in variant sample detection for dataset I and has highest F-scores on all cases. In particular, it is well suited to decode rare variant carriers. Precision and recall of VIP, DNA Sudku and Overlap Log on dataset I in higher resolutions can be found in Supplementary Figures S5 and S6.

For dataset II, results in Table 2 show that VIP Sudoku performs better than VIP_Syzygy Sudoku in identifying variant samples for all VAF intervals in terms of F-scores. For loci with VAF > 0.05, VIP and VIP_Syzygy have almost the same precisions, but VIP always has better recalls. For loci with VAF ≤ 0.05, VIP has a better recall (0.425 versus 0.311), but VIP_Syzygy has a better precision (0.951 versus 0.771). We further examine the precisions and recalls of the two approaches on variant loci with VAF ≤ 0.05 with higher resolution in Figure 4 D and H and Supplementary Figure S5D and H). Results show that VIP has better recalls on all the intervals and comparable precisions for 0.005 < VAF ≤ 0.05. The only interval that VIP has a lower precision is for VAF ≤ 0.005, in which case there are at most three variant alleles in the samples.

In this article, we have presented a complete data analysis package, VIP, mainly for rare SNP discovery and calling from data generated based on overlapping pool designs. VIP itself consists of several algorithms, including variant pool and variant locus identification, VAF estimation and variant sample detection. It can be combined with any design matrices. For the proposed LRT for variant pool identification, we also investigated the quantitative relationship between power, coverage and VAR, for a given significance level and sequencing error rate. We have performed extensive experiments on simulated data and on a dataset generated based on real sequence data from the 1000 Genome project to demonstrate the effectiveness of VIP. In summary, VIP significantly outperforms all three approaches considered in this study and is very effective in identifying variant pools and variant loci, and in estimating VAFs. Furthermore, in combination with the Sudoku design, VIP Sudoku is also more effective in identifying variant samples than other approaches.

In terms of costs, in most cases, pooling is cheaper than no pooling, mostly because of savings in target capturing (See Supplementary Section 2.5 for more discussions). Among the two designs tested, the logarithm design has the least overall costs, but it is not effective in identifying variant samples. Another possible strategy is to combine the logarithm design with customized SNP arrays to type the samples after rare SNPs being discovered. The relative merit of such a strategy will be explored further and will be compared with the VIP Sudoku algorithm in future studies. In addition to overlapping pool designs, the frequency estimation algorithm in VIP can be used in other biological applications. For example, viruses in patients naturally consist of many different strains/species. It is important to examine population diversity of viruses in patients and their correlations with treatment effectiveness. Many SNP mutations in virus populations will have small frequencies. One can use the proposed algorithm and its extensions to discover SNPs in viruses and estimate their frequencies.

In our derivation, we did not consider possible sampling bias generated from uneven amount of DNAs from different samples within a pool and in different pools. Bias can also be generated from non-uniform amplifications of DNAs from different samples. We will take these possible biases into account in our future work. In addition to SNPs, indels are another type of important genetic variations. Indel identification from pooled samples will also be one of our future working directions.

One limitation of pooling designs is that they cannot identify carriers of more common SNPs. Recently, He *et al.* (2011) shed light on this issue. By taking advantage of linkage disequilibrium information and prior information on genotypes of some common SNPs from the same samples, one can potentially impute the genotypes on common SNPs. Therefore, genotypes of both rare SNPs and common SNPs can be recovered. We will work on improving the performance on common variant sample detection by incorporating external information into our framework.

The authors thank the reviewers for their constructive comments.

*Funding:* This research is supported by National Institutes of Health/National Library of Medicine grant LM008991.

*Conflict of Interest:* none declared.

- Altshuler DM, et al. A haplotype map of the human genome. Nature. 2005;437:1299–1320. [PMC free article] [PubMed]
- Altshuler DM, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–58. [PMC free article] [PubMed]
- Bansal V, et al. A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics. 2010;26:i318–i324. [PMC free article] [PubMed]
- Benjamini Y, et al. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 2001;29:1165–1188.
- Birney E, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. [PMC free article] [PubMed]
- Calvo SE, et al. High-throughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency. Nat. Genet. 2010;42:851–858. [PMC free article] [PubMed]
- DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. [PMC free article] [PubMed]
- Ding C, et al. Chinese Remainder Theorem: Applications in Computing, Coding, Cryptography. World Scientific, Singapore/River Edge, NJ; 1996.
- Druley TE, et al. Quantification of rare allelic variants from pooled genomic DNA. Nat. Methods. 2009;6:263–265. [PMC free article] [PubMed]
- Du D, Hwang F. Combinatorial Group Testing and its Applications. World Scientific, Singapore/River Edge, NJ; 2000.
- Efron B. 1977 Rietz lecture—bootstrap methods—another look at the jackknife. Ann. Stat. 1979;7:1–26.
- Erlich Y, et al. DNA Sudoku-harnessing high-throughput sequencing for multiplexed specimen analysis. Genome Res. 2009;19:1243–1253. [PubMed]
- Frazer KA, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. [PMC free article] [PubMed]
- He D, et al. Genotyping common and rare variation using overlapping pool sequencing. BMC Bioinformatics. 2011;12(Suppl. 6): S2. [PMC free article] [PubMed]
- Kautz W, Singleton R. Nonrandom binary superimposed codes. IEEE Trans. Inf. Theory. 1964;10:363C377.
- Kim SY, et al. Design of association studies with pooled or un-pooled next-generation sequencing data. Genet. Epidemiol. 2010;34:479–491. [PubMed]
- Lee JS, et al. On optimal pooling designs to identify rare variants through massive resequencing. Genet. Epidemiol. 2011;35:139–147. [PMC free article] [PubMed]
- Li H, et al. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851–1858. [PubMed]
- Manolio TA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. [PMC free article] [PubMed]
- McCarthy MI, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 2008;9:356–369. [PubMed]
- McKenna A, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. [PubMed]
- Momozawa Y, et al. Resequencing of positional candidates identifies low frequency IL23R coding variants protecting against inflammatory bowel disease. Nat. Genet. 2011;43:43–47. [PubMed]
- Mood AM, et al. Introduction to the Theory of Statistics. NY, USA: McGraw-Hill; 1973.
- Nejentsev S, et al. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science. 2009;324:387–389. [PMC free article] [PubMed]
- Nijman IJ, et al. Mutation discovery by targeted genomic enrichment of multiplexed barcoded samples. Nat. Methods. 2010;7:913–967. [PubMed]
- Out AA, et al. Deep sequencing to reveal new variants in pooled DNA samples. Hum. Mutat. 2009;30:1703–1712. [PubMed]
- Prabhu S, Pe’er I. Overlapping pools for high-throughput targeted resequencing. Genome Res. 2009;19:1254–1261. [PubMed]
- Rivas MA, Daly MJ. 2011. Syzygy Documentation. Release 1.1.0. http://www.broadinstitute.org/software/syzygy/sites/default/files/Syzygy_2011.pdf (July 2012, date last accessed)
- Rivas MA, et al. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat. Genet. 2011;43:1066–1073. [PMC free article] [PubMed]
- Li H, et al. The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009;25:2078–2079. [PMC free article] [PubMed]
- Smith AM, et al. Highly-multiplexed barcode sequencing: an efficient method for parallel analysis of pooled samples. Nucleic Acids Res. 2010;38:e142. [PMC free article] [PubMed]
- Thousand Genomes Project Consortium. (2010) A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. [PMC free article] [PubMed]
- Vallania FLM, et al. High-throughput discovery of rare insertions and deletions in large cohorts. Genome Res. 2010;20:1711–1718. [PubMed]
- Wang T, et al. Resequencing of pooled DNA for detecting disease associations with rare variants. Genet. Epidemiol. 2010;34:492–501. [PubMed]
- Wei Z, et al. SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Res. 2011;39:e132. [PMC free article] [PubMed]
- Wellcome Trust Case Control. (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |