Massively parallel sequencing has expanded the genomics era by dramatically reducing the cost and time of large-scale DNA sequencing. Although whole genome sequencing may soon become routine, in terms of cost, time, and labor, it is often more practical to target specific regions of interest in the genome. Targeted genomic enrichment (TGE), also known as targeted sequence capture, allows efficient isolation of genomic regions prior to massively parallel sequencing [1
]. Briefly, DNA libraries are hybridized with DNA or RNA oligonucleotides complementary to regions of interest (baits), and these bait-library complexes are pulled out of solution after the hybridization to generate an enriched library for sequencing. This method has generally been used to target exonic and splice-site sequences of the human genome, as ~ 85% of known mutations reside in these regions [2
TGE has been used to discover mutations in sets of genes associated with specific diseases [3
] or, in an unbiased way, by targeting the whole exome [2
]. A trade-off exists between the number of base-pairs targeted for sequencing and the throughput of sequencing with respect to cost and time. However, in all types of TGE, there is increased efficiency in uncovering disease-causing mutations when compared to whole genome sequencing.
The increase in sequencer output has well outpaced our ability to efficiently use the sequence generated. While it is clear that saturating levels of sequencing coverage are required for the lowest false positive and false negative rates in TGE experiments [6
], it is also clear that there is a diminishing return after the threshold coverage level for variant detection is exceeded, and, in fact, over-coverage can introduce errors [7
]. Pooling samples is an attractive option to maximize sequencer output; however, due to sequencer error rate, reliable differentiation of true positives from false positives is generally difficult [8
] unless specialized software is employed [9
]. To avoid this difficulty, molecular barcodes (indexes) can be ligated to sheared DNA fragments prior to pooling in a process called multiplexing. Because the purpose of TGE is to focus on relatively small genomic regions of interest, multiplexing can be used to maximize sequencer output.
When applied to TGE, multiplexing can be performed prior to capture (pre-capture multiplexing) or after capture (post-capture multiplexing). The first protocols using molecular barcodes employed a post-capture approach [11
] and post-capture multiplexing has since become the standard method for TGE. Several studies, however, have shown that the pre-capture approach is also feasible and that this method offers three important advantages over post-capture multiplexing: 1) decreased cost as the capture step is generally the most costly step in the TGE protocol; 2) reduced hands-on time as samples are pooled earlier in the protocol; and, 3) reduced cross-contamination risk by the earlier addition barcodes [12
However, the limits of pre-capture multiplexing have not been thoroughly tested. Of two pilot studies showing the feasibility of pre-capture multiplexing one used 8 samples and single pools of 3 or 5 samples pooled pre-capture [12
], while another used 9 samples with single pools of either 3 or 9 samples pooled pre-capture [13
]. Both of these studies included only 1 pool at each pooling size and so a detailed comparison of the effects of increasing pool size and a comparison to standard post-capture pooling was not possible. In another study, pre-capture pools of 4, 6, and 12 were evaluated and the authors went on to use 6 pools of 8 samples to sequence 48 samples [14
]. And finally, in another study the authors pooled 20 samples pre-capture, however, only a single pool was used and micro-array based TGE was used [15
]. All of these studies showed the feasibility of pre-capture pooling for solution-phase TGE, however we sought to study the effects of pre-capture pooling on TGE in a systematic fashion in different pool sizes on a large number of samples.
In general, post-capture multiplexing is used for TGE whereas pre-capture multiplexing is not. In this study, we sought to address several questions germane to multiplexing during TGE experiments, including the effect of inter-sample competition for capture baits during hybridization, the impact on capture efficiency, and the downstream effects on overall sequence quality as measured by read mapping and duplicate reads. To address these questions, we compared standard post-capture multiplexing to pre-capture multiplexing using two pool sizes (n=12 or 16 samples per pool) to study a large set of samples (n = 96). We demonstrate the significant advantages of pre-capture multiplexing in cost and time reductions while at the same time maintaining our minimum threshold for accurate variant detection.