Microdroplet PCR Workflow
Microdroplet technology is particularly well suited for processing DNA for massively parallel amplification of sequencing targets by enabling the routine preparation of 1.5 million separate PCR reactions from 20 microliters of template solution containing just 7.5 micrograms of genomic-DNA. Microdroplet PCR is achieved through the following steps (): picoliter volume droplets of fragmented genomic DNA template are merged with pre-made primer pair droplets (Primer Library), pooled thermal cycling of the resulting PCR reactions (Droplet PCR), and destabilizing the droplets to release the PCR product (Break Emulsion) for purification and sequencing.
A key component of the microdroplet PCR process is the generation of a high quality primer library. The forward and reverse primers designed to amplify a targeted genomic interval are combined at a concentration of 1.1 μM per primer (), and reformatted as 8 picoliter-volume primer pair droplets (encapsulation of the primer pairs in inert fluorinated carrier oil) by using a flow-focusing nozzle (Supplemental Fig. 1
to make a library element. Library elements are combined together such that each library element is represented by an equal number of droplets within the final primer library (). The resulting primer library is mixed and quality tested for primer pair droplet size and uniformity, as well as library element representation (Supplemental Fig. 2
The genomic DNA template mixture () contains all of the components for PCR except for the primers and is prepared by fragmenting genomic DNA using DNase I to produce 2-4 kb fragments () (Online Methods). On the microfluidic chip the template mixture is made into droplets and paired with primer pair droplets. The paired template and primer droplets enter the merge area on the microfluidic chip () at a rate of ~3000 droplets per second. As the primer pair droplets (8 pL) are smaller than the template droplets (14 pL) they move faster through the channels until they contact the preceding template droplet. Field induced coalescence of these droplet pairs16
results in the two droplets merging to produce a single PCR droplet which are collected and processed as an emulsion PCR reaction (Supplemental Fig. 3
Validation phase - targeted sequences and DNA samples
To validate the microdroplet PCR approach we selected 47 genes distributed across the genome to determine the extent to which a set of PCR primer pairs could be chosen to simultaneous amplify numerous loci with a broad spectrum of sequence compositions using a single set of reaction conditions (Supplemental Table 1
). Local DNA context, such as repetitive elements, GC content and sequence variants, can affect the ability of PCR primer pairs to work. One major drawback of PCR sample preparation is that allelic imbalanced amplification (greater efficiency of amplification for one of the alleles) due to variants in primer sites is sometimes observed17, 18
. This type of PCR bias results in variant base calling errors, specifically heterozygous sites are called homozygous. Although the majority of SNPs shared among all populations have been discovered19
and are taken into account during primer design, population specific and rare variants are largely unknown. To check for biases in the ability of primers designed to the reference genome to amplify genomic DNA from individuals of different ethnicities, we included three HapMap samples of European and three of African descent. We also selected ~60% of the targeted sequences from three ENCODE20
intervals that were sequenced in five of the HapMap samples in our study as part of the HapMap Consortium21
. This allowed us to compare the effect of allelic imbalanced amplification in intervals of the genome in which the majority of variation is known and can be accounted for during primer design versus intervals that are less well annotated for variation.
Of the 47 genes targeted in the validation phase, 29 are from ENCODE20
intervals, 8 are members of the Transient Receptor Potential ion (TRP) channel superfamily (included to test the ability to amplify and correctly align reads among family members with high nucleotide similarity), and the remaining 11 are candidates for deep venous thrombosis. Our primer design strategy (Online Methods) split the 435 exons of the 47 genes into 457 amplicons of varying sizes (119-956 bp) and GC content (24-78%) (Supplemental Tables 2 & 3
). The full amplicon sequences constituted 172.2 kb and the exonic sequences comprised 75.9 kb (44%). It is important to note that the primer selection process did not attempt to minimize off-exon sequence, and thus if desired the ratio of exon:off-exon sequences in the amplicons could be increased. The primer pairs were designed and put into production in a single pass; there was no redesign of failed primer pairs.
Specificity of amplification
Using the validation phase 457 PCR primer set we amplified the six DNA samples using both traditional singleplex and microdroplet PCR. After the PCR amplification step all samples from the two methods were processed simultaneously through library creation and sequencing to ensure that any differences in the data were due to the amplification methods. Samples were labeled with a DNA barcode during the library preparation step and for each PCR method all six bar-coded samples were run on a single lane (Online Methods). The two lanes generated equivalent amounts of sequence data with each sample having an average of 1.27 million reads passing quality filters ().
Illumina GAII reads and mapping statistics
To determine how well the 457 primer pairs specifically amplified the targeted sequences we calculated the percentage of filtered reads that mapped to a targeted amplicon. Considering all six samples 82% and 73% of filtered reads successfully mapped to the full amplicon sequences for the traditional and microdroplet reactions, respectively (). Considering only uniquely mapping reads 96% (231.69 Mb) of the traditional and 84% (202.15 Mb) of the microdroplet reads map back to the amplicons. After trimming primer sequences from the amplicons, which account for 12% of the total sequence, of the uniquely mapping reads 91% of the traditional and 80% of the microdroplet reads mapped to targeted sequences. Of the reads mapping to the amplicons, 55.6% and 51.1% map to the exons for the traditional and microdroplet PCR, respectively. In the microdroplet PCR reaction the ratio of input genomic DNA to amount of final PCR product is considerably greater than it is for traditional PCR (Online Methods), which may explain the ~15% difference between the two methods in reads that map off target. Overall, these data demonstrate that the microdroplet PCR enrichment method generates a high proportion of target sequence to background sequences.
Coverage uniformity and reproducibility
Uniformity of sequence coverage across targeted sequences is important because it determines the average depth to which samples have to be sequenced to have optimal sensitivity for variant calling. If the coverage differs greatly across targeted bases then one has to sequence deeply to adequately cover underrepresented bases. We initially examined how the base by base sequence coverage differs within individual amplicons. Visual inspection of coverage plots of individual targeted sequences () showed that there is typically a two to three-fold variation in the per-base sequence coverage depth across the amplicon. Interestingly, a noticeable dip in sequence coverage occurs at the transition from the 36th
base pair in the amplicon, which is likely due to more reads (36 bp in length) starting at base one compared to bases 2 to 5. We next compared the coverage uniformity across all targeted bases for the six samples and the two PCR methods. The six samples had slightly different sequence yields in the traditional and microdroplet PCR methods () and therefore we normalized the coverage (divided the coverage of each base by the mean coverage of all targeted bases) so we could directly compare their coverage distribution plots (). We first determined the percentage of targeted bases with sequence coverage between 1/5 to 5 times the mean. For traditional PCR, 92.8% of all bases fell within this range with 99.6% of the bases covered by at least one read (, Supplemental Table 4
). The microdroplet PCR showed similar results with 90.2% of all bases falling within 1/5 and 5 times the mean with 99.8% of all targeted bases covered by at least one read (). These results indicate that for studies using microdroplet PCR amplified DNA and Illumina GA sequencing to an average depth of 25X that ~ 90% of the bases will be covered with 5 or more reads and with ~98% of all bases covered by at least a single read.
Coverage plots of targeted sequences
Normalized Coverage Distribution Plots
The consistency and reproducibility of coverage across targeted bases from sample to sample is of high importance for population studies. This is due to the fact that coverage is directly correlated with accuracy in base calling and to perform sequence-based association studies the same bases need to be analyzed across numerous samples. We compared the mean coverage of each amplicon between samples showing that reproducibility was excellent between samples with high correlations of amplicon coverage within a PCR method (average Lin's concordance: r2 = 0.88 traditional PCR, r2 = 0.91 microdroplet PCR) (). On the other hand, coverage depth of specific amplicons between the two PCR methods varied (), most likely due to differences in PCR chemistry and cycling conditions. Interestingly, of the 457 amplicons there was only one that failed completely (no reads) in four samples with the traditional PCR and a different amplicon that failed completely in three individuals with the microdroplet PCR. In both PCR methods >99.6% of the amplicons were successful, defined as a mean coverage greater than 5. Coverage was similar for the European and African ancestry samples and for the targeted sequences in ENCODE and non-ENCODE intervals. These data demonstrate that microdroplet PCR results in consistent and reproducible coverage of targeted sequences.
Inter-sample reproducibility of amplicon coverage
Accuracy of sequence variant calls
We evaluated the accuracy of variant bases called in the PCR amplified sequences for the six samples by comparison to HapMap genotypes for ~450 SNPs (). There were a total of 2424 comparisons for the tradition PCR across the six samples with 22 of the comparisons being discordant for a consensus rate of 99.1% (). The microdroplet PCR had the same consensus rate with 22 discordances out of 2390 comparisons. In both PCR methods ~ 2.5% of the HapMap variant sites were uncalled primarily due to low coverage, poor mapping quality, or the presence of neighboring variants. There was no observed difference in call rate or concordance between either the ENCODE and non-ENCODE intervals or between samples in the two ethnicity groups. Importantly, the two PCR methods were concordant with each other at all positions discordant with the HapMap. Of the 22 loci discordant with HapMap the errors fell into three classes: 1. nine are homozygous reference genotypes with HapMap homozygous alternate; 2. five are heterozygous genotypes with HapMap homozygous reference; and 3. eight are homozygous genotypes with HapMap heterozygous (Supplemental Table 5
). Further examination of the discordant bases revealed most of those in class 1 are present in a single targeted sequence with a highly similar homolog in the genome. Our sequencing reports the reference alleles for the target while HapMap calls the alternate alleles, which correspond to the fixed bases at those positions in the homolog. Inspection of the ENCODE traces revealed that the amplicons appeared to be amplifying both homologs as supported by all individuals being heterozygous at the differing positions (Supplemental Fig. 4
). Class 2 and 3 errors most likely represent missed heterozygous calls in the HapMap and sequence data, respectively. Because allelic imbalanced amplification is a concern when performing PCR enrichment we focused on the eight Class 3 errors. Manual inspections of ENCODE sequence traces for five of these variants revealed that 4 are homozygous calls consistent with our sequence data. To determine if allelic bias during the PCR reaction was common among concordant variant calls we calculated the number of reference and alternate allele observations at heterozygote sites in the six samples. No skew was observed with 51.6% (CV 12%) and 51.1% (CV 13%) of the reads reporting the reference allele in traditional and microdroplet, respectively with no difference between ENCODE and non-ENCODE sites. Our analysis indicates that allele biased amplification results in only ~0.1% of the variants to be called incorrectly (≤ 5 out of 2390) and that >50% of the discrepancies between HapMap genotypes and variant calls in our sequence data are due to errors in the HapMap data.
Validation phase sequence variant detection rates and concordance
To estimate the false positive rate we analyzed the variants identified in the targeted genes in the ENCODE intervals sequenced in five of the samples as part of the HapMap project. A total of 1122 variants at 442 positions were identified in these genes, of which 62 positions were not listed as variants in the HapMap data (Supplemental Table 6
). Thirty-five of these positions are listed as variant in dbSNP suggesting that they are not false positives. Of the 27 positions not listed in dbSNP, 12 have concordant calls between the traditional and microdroplet PCR for at least one sample suggesting they may be true novel variants. The remaining 15 sites were either discrepant in the traditional but not in the microdroplet PCR (1 position, 3 variant calls) or discrepant by one method and the position did not pass quality filters in the second method (14 positions, 15 variant calls). These analyses suggest that the false positive rate of the sequence data is < 1.6% (≤18 of 1122 variant calls).
Scale up phase – selection of target sequences
To efficiently perform population-based sequencing studies using next generation sequencing platforms it is important to be able to simultaneously examine large numbers of targeted sequences. In this next phase we tested the ability of the microdroplet PCR to scale-up to 3976 primer pairs, which in total amplify 1.49 Mb of sequence, of which 645 kb (43%) consists of exonic sequence. We also tested the compatibility of this enrichment process with other next generation technologies (Online Methods). To further test the robustness of the microdroplet PCR process we choose primer pairs with widely varying characteristics, including primer lengths (17 bases - 30 bases), primer Tm (55.4°C - 61.2°C), amplicon length (299 bp - 659 bp) and amplicon GC content (25.1% - 81.5%), to determine how this affected the outcome of targeted sequence amplification.
Specificity of amplification
A single HapMap sample, NA18858, was amplified using the scale-up phase 3976 amplicons by microdroplet PCR (Online Methods). The resulting amplified material was then divided into two aliquots and sequenced by both the Illumina GA II (short-read platform) and Roche 454 FLX (long read platform). The Illumina run generated >10 million quality single-end reads with similar fractions, as those observed in the validation phase, of both total reads (76%) and uniquely mapped reads (79%) successfully mapped to the trimmed amplicon sequences (). The Roche 454 run generated ~350,000 quality reads with ~94% of these mapping to the full amplicons. The larger fraction of Roche reads mapping to targets is likely due to the fact that the microfluidic PCR amplified DNA is not fragmented in the Roche 454 library preparation protocol and therefore the non-specific genomic DNA carryover from the amplification step (~2 to 4 kb in length) is too long to be efficiently used in library generation. These results indicate that even when simultaneously amplifying 3976 targeted sequences microfluidic PCR generates a high ratio of target sequence to background sequence.
Coverage uniformity and accuracy of variant calls
We evaluated the ability to uniformly capture the intended target across all 3976 amplicons. Only one amplicon failed completely resulting in zero mapping reads for both the Illumina and Roche 454 platforms. After trimming primer sequences we enriched for a total of 1.35 Mb of sequence, on the Illumina platform 99.8% of the bases were covered by 1 or more reads, of which 96.6% fell within 1/5 and 5 times the mean coverage (). Furthermore, 90% of all amplicons had coverage at 1/5 the mean or greater for at least 95% of the basis, for 5X coverage this increased to 98% of the amplicons. Coverage was lower for amplicons greater than 70% GC content but was unaffected by amplicon length (Supplemental Fig. 5
). The proportion of reads mapping to exons is closely proportional to the fraction of exonic sequences in the amplicons. The uniformity of coverage was slightly decreased when sequenced on the Roche 454 platform, with 93.7% of the bases covered by at least 1 read and 88.4% falling within 1/5 and 5 times the mean (). The lower coverage uniformity observed in the Roche 454 sequence results from sequencing the amplicon ends without shearing the input DNA in the library preparation, and therefore the middle bases of short amplicons have high coverage and the middle bases of longer amplicons receive little or no coverage ().
To examine accuracy, variant bases called in the Illumina and Roche 454 sequences were compared to 2226 HapMap genotypes. The Illumina sequence had a call rate of 99.3% with 26 discordant SNPs, while the Roche 454 sequence had a call rate of 92.3% with 31 discordant SNPs. The discrepant variants overlapped at 21 sites between the two sequence platforms and largely fell into the same error classes observed in the validation phase, with the exception that several discrepant Roche 454 SNPs were present in homopolymer sequences (Supplemental Table 7
) consistent with previous observations.18
The Roche 454 lower call rate is related to the low coverage bases in the middle of longer amplicons (). Overall this data shows that increasing the number of amplicons in the microdroplet PCR reaction almost an order of magnitude does not affect the resulting data quality.